标签归档:string

删除字符串中的字符列表

问题:删除字符串中的字符列表

我想在python中删除字符串中的字符:

string.replace(',', '').replace("!", '').replace(":", '').replace(";", '')...

但是我必须删除许多字符。我想到了一个清单

list = [',', '!', '.', ';'...]

但是,如何使用list来替换中的字符string

I want to remove characters in a string in python:

string.replace(',', '').replace("!", '').replace(":", '').replace(";", '')...

But I have many characters I have to remove. I thought about a list

list = [',', '!', '.', ';'...]

But how can I use the list to replace the characters in the string?


回答 0

如果您使用的是python2,而您的输入是字符串(不是unicodes),则绝对最佳的方法是str.translate

>>> chars_to_remove = ['.', '!', '?']
>>> subj = 'A.B!C?'
>>> subj.translate(None, ''.join(chars_to_remove))
'ABC'

否则,可以考虑以下选项:

A.通过char迭代主题char,省略不需要的字符和join结果列表:

>>> sc = set(chars_to_remove)
>>> ''.join([c for c in subj if c not in sc])
'ABC'

(请注意,生成器版本 ''.join(c for c ...)效率较低)。

B.动态创建一个正则表达式,并re.sub带有一个空字符串:

>>> import re
>>> rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
>>> re.sub(rx, '', subj)
'ABC'

re.escape确保字符喜欢^]不会破坏正则表达式)。

C.使用以下映射的变体translate

>>> chars_to_remove = [u'δ', u'Γ', u'ж']
>>> subj = u'AжBδCΓ'
>>> dd = {ord(c):None for c in chars_to_remove}
>>> subj.translate(dd)
u'ABC'

完整的测试代码和计时:

#coding=utf8

import re

def remove_chars_iter(subj, chars):
    sc = set(chars)
    return ''.join([c for c in subj if c not in sc])

def remove_chars_re(subj, chars):
    return re.sub('[' + re.escape(''.join(chars)) + ']', '', subj)

def remove_chars_re_unicode(subj, chars):
    return re.sub(u'(?u)[' + re.escape(''.join(chars)) + ']', '', subj)

def remove_chars_translate_bytes(subj, chars):
    return subj.translate(None, ''.join(chars))

def remove_chars_translate_unicode(subj, chars):
    d = {ord(c):None for c in chars}
    return subj.translate(d)

import timeit, sys

def profile(f):
    assert f(subj, chars_to_remove) == test
    t = timeit.timeit(lambda: f(subj, chars_to_remove), number=1000)
    print ('{0:.3f} {1}'.format(t, f.__name__))

print (sys.version)
PYTHON2 = sys.version_info[0] == 2

print ('\n"plain" string:\n')

chars_to_remove = ['.', '!', '?']
subj = 'A.B!C?' * 1000
test = 'ABC' * 1000

profile(remove_chars_iter)
profile(remove_chars_re)

if PYTHON2:
    profile(remove_chars_translate_bytes)
else:
    profile(remove_chars_translate_unicode)

print ('\nunicode string:\n')

if PYTHON2:
    chars_to_remove = [u'δ', u'Γ', u'ж']
    subj = u'AжBδCΓ'
else:
    chars_to_remove = ['δ', 'Γ', 'ж']
    subj = 'AжBδCΓ'

subj = subj * 1000
test = 'ABC' * 1000

profile(remove_chars_iter)

if PYTHON2:
    profile(remove_chars_re_unicode)
else:
    profile(remove_chars_re)

profile(remove_chars_translate_unicode)

结果:

2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]

"plain" string:

0.637 remove_chars_iter
0.649 remove_chars_re
0.010 remove_chars_translate_bytes

unicode string:

0.866 remove_chars_iter
0.680 remove_chars_re_unicode
1.373 remove_chars_translate_unicode

---

3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

"plain" string:

0.512 remove_chars_iter
0.574 remove_chars_re
0.765 remove_chars_translate_unicode

unicode string:

0.817 remove_chars_iter
0.686 remove_chars_re
0.876 remove_chars_translate_unicode

(作为附带说明,该数字remove_chars_translate_bytes可能为我们提供了一个线索,说明为什么该行业这么长时间不愿采用Unicode)。

If you’re using python2 and your inputs are strings (not unicodes), the absolutely best method is str.translate:

>>> chars_to_remove = ['.', '!', '?']
>>> subj = 'A.B!C?'
>>> subj.translate(None, ''.join(chars_to_remove))
'ABC'

Otherwise, there are following options to consider:

A. Iterate the subject char by char, omit unwanted characters and join the resulting list:

>>> sc = set(chars_to_remove)
>>> ''.join([c for c in subj if c not in sc])
'ABC'

(Note that the generator version ''.join(c for c ...) will be less efficient).

B. Create a regular expression on the fly and re.sub with an empty string:

>>> import re
>>> rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
>>> re.sub(rx, '', subj)
'ABC'

(re.escape ensures that characters like ^ or ] won’t break the regular expression).

C. Use the mapping variant of translate:

>>> chars_to_remove = [u'δ', u'Γ', u'ж']
>>> subj = u'AжBδCΓ'
>>> dd = {ord(c):None for c in chars_to_remove}
>>> subj.translate(dd)
u'ABC'

Full testing code and timings:

#coding=utf8

import re

def remove_chars_iter(subj, chars):
    sc = set(chars)
    return ''.join([c for c in subj if c not in sc])

def remove_chars_re(subj, chars):
    return re.sub('[' + re.escape(''.join(chars)) + ']', '', subj)

def remove_chars_re_unicode(subj, chars):
    return re.sub(u'(?u)[' + re.escape(''.join(chars)) + ']', '', subj)

def remove_chars_translate_bytes(subj, chars):
    return subj.translate(None, ''.join(chars))

def remove_chars_translate_unicode(subj, chars):
    d = {ord(c):None for c in chars}
    return subj.translate(d)

import timeit, sys

def profile(f):
    assert f(subj, chars_to_remove) == test
    t = timeit.timeit(lambda: f(subj, chars_to_remove), number=1000)
    print ('{0:.3f} {1}'.format(t, f.__name__))

print (sys.version)
PYTHON2 = sys.version_info[0] == 2

print ('\n"plain" string:\n')

chars_to_remove = ['.', '!', '?']
subj = 'A.B!C?' * 1000
test = 'ABC' * 1000

profile(remove_chars_iter)
profile(remove_chars_re)

if PYTHON2:
    profile(remove_chars_translate_bytes)
else:
    profile(remove_chars_translate_unicode)

print ('\nunicode string:\n')

if PYTHON2:
    chars_to_remove = [u'δ', u'Γ', u'ж']
    subj = u'AжBδCΓ'
else:
    chars_to_remove = ['δ', 'Γ', 'ж']
    subj = 'AжBδCΓ'

subj = subj * 1000
test = 'ABC' * 1000

profile(remove_chars_iter)

if PYTHON2:
    profile(remove_chars_re_unicode)
else:
    profile(remove_chars_re)

profile(remove_chars_translate_unicode)

Results:

2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]

"plain" string:

0.637 remove_chars_iter
0.649 remove_chars_re
0.010 remove_chars_translate_bytes

unicode string:

0.866 remove_chars_iter
0.680 remove_chars_re_unicode
1.373 remove_chars_translate_unicode

---

3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

"plain" string:

0.512 remove_chars_iter
0.574 remove_chars_re
0.765 remove_chars_translate_unicode

unicode string:

0.817 remove_chars_iter
0.686 remove_chars_re
0.876 remove_chars_translate_unicode

(As a side note, the figure for remove_chars_translate_bytes might give us a clue why the industry was reluctant to adopt Unicode for such a long time).


回答 1

您可以使用str.translate()

s.translate(None, ",!.;")

例:

>>> s = "asjo,fdjk;djaso,oio!kod.kjods;dkps"
>>> s.translate(None, ",!.;")
'asjofdjkdjasooiokodkjodsdkps'

You can use str.translate():

s.translate(None, ",!.;")

Example:

>>> s = "asjo,fdjk;djaso,oio!kod.kjods;dkps"
>>> s.translate(None, ",!.;")
'asjofdjkdjasooiokodkjodsdkps'

回答 2

您可以使用翻译方法。

s.translate(None, '!.;,')

You can use the translate method.

s.translate(None, '!.;,')

回答 3

''.join(c for c in myString if not c in badTokens)
''.join(c for c in myString if not c in badTokens)

回答 4

如果您使用的是python3并且正在寻找translate解决方案-函数已更改,现在使用1个参数而不是2个参数。

该参数是一个表(可以是字典),其中每个键是要查找的字符的Unicode序数(int),值是替换字符(可以是Unicode序数或将键映射到的字符串)。

这是一个用法示例:

>>> list = [',', '!', '.', ';']
>>> s = "This is, my! str,ing."
>>> s.translate({ord(x): '' for x in list})
'This is my string'

If you are using python3 and looking for the translate solution – the function was changed and now takes 1 parameter instead of 2.

That parameter is a table (can be dictionary) where each key is the Unicode ordinal (int) of the character to find and the value is the replacement (can be either a Unicode ordinal or a string to map the key to).

Here is a usage example:

>>> list = [',', '!', '.', ';']
>>> s = "This is, my! str,ing."
>>> s.translate({ord(x): '' for x in list})
'This is my string'

回答 5

使用正则表达式的另一种方法:

''.join(re.split(r'[.;!?,]', s))

Another approach using regex:

''.join(re.split(r'[.;!?,]', s))

回答 6

为什么不进行简单循环?

for i in replace_list:
    string = string.replace(i, '')

另外,避免将列表命名为“列表”。它覆盖了内置函数list

Why not a simple loop?

for i in replace_list:
    string = string.replace(i, '')

Also, avoid naming lists ‘list’. It overrides the built-in function list.


回答 7

你可以用这样的东西

def replace_all(text, dic):
  for i, j in dic.iteritems():
    text = text.replace(i, j)
  return text

这段代码不是我自己的,来自这里,是一篇很棒的文章,并深入探讨了这一点。

you could use something like this

def replace_all(text, dic):
  for i, j in dic.iteritems():
    text = text.replace(i, j)
  return text

This code is not my own and comes from here its a great article and dicusses in depth doing this


回答 8

关于从字符串中将char转换为标准非重音char的字符串删除UTF-8重音也是一个有趣的话题:

删除python unicode字符串中的重音符号的最佳方法是什么?

来自主题的代码摘录:

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

Also an interesting topic on removal UTF-8 accent form a string converting char to their standard non-accentuated char:

What is the best way to remove accents in a python unicode string?

code extract from the topic:

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

回答 9

也许是一种更现代和实用的方法来实现您的期望:

>>> subj = 'A.B!C?'
>>> list = set([',', '!', '.', ';', '?'])
>>> filter(lambda x: x not in list, subj)
'ABC'

请注意,对于此特定目的,这是一个过大的杀伤力,但是一旦您需要更复杂的条件,过滤器就会派上用场

Perhaps a more modern and functional way to achieve what you wish:

>>> subj = 'A.B!C?'
>>> list = set([',', '!', '.', ';', '?'])
>>> filter(lambda x: x not in list, subj)
'ABC'

please note that for this particular purpose it’s quite an overkill, but once you need more complex conditions, filter comes handy


回答 10

简单的方法

import re
str = 'this is string !    >><< (foo---> bar) @-tuna-#   sandwich-%-is-$-* good'

// condense multiple empty spaces into 1
str = ' '.join(str.split()

// replace empty space with dash
str = str.replace(" ","-")

// take out any char that matches regex
str = re.sub('[!@#$%^&*()_+<>]', '', str)

输出:

this-is-string--foo----bar--tuna---sandwich--is---good

simple way,

import re
str = 'this is string !    >><< (foo---> bar) @-tuna-#   sandwich-%-is-$-* good'

// condense multiple empty spaces into 1
str = ' '.join(str.split()

// replace empty space with dash
str = str.replace(" ","-")

// take out any char that matches regex
str = re.sub('[!@#$%^&*()_+<>]', '', str)

output:

this-is-string--foo----bar--tuna---sandwich--is---good


回答 11

怎么样-一个衬垫。

reduce(lambda x,y : x.replace(y,"") ,[',', '!', '.', ';'],";Test , ,  !Stri!ng ..")

How about this – a one liner.

reduce(lambda x,y : x.replace(y,"") ,[',', '!', '.', ';'],";Test , ,  !Stri!ng ..")

回答 12

我认为这很简单并且可以!

list = [",",",","!",";",":"] #the list goes on.....

theString = "dlkaj;lkdjf'adklfaj;lsd'fa'dfj;alkdjf" #is an example string;
newString="" #the unwanted character free string
for i in range(len(TheString)):
    if theString[i] in list:
        newString += "" #concatenate an empty string.
    else:
        newString += theString[i]

这是做到这一点的一种方法。但是,如果您厌倦了要保留要删除的字符列表,则实际上可以使用迭代的字符串的顺序号来完成。订单号是该字符的ascii值。0作为字符的ascii数为48,小写字母z的ascii数为122,因此:

theString = "lkdsjf;alkd8a'asdjf;lkaheoialkdjf;ad"
newString = ""
for i in range(len(theString)):
     if ord(theString[i]) < 48 or ord(theString[i]) > 122: #ord() => ascii num.
         newString += ""
     else:
        newString += theString[i]

i think this is simple enough and will do!

list = [",",",","!",";",":"] #the list goes on.....

theString = "dlkaj;lkdjf'adklfaj;lsd'fa'dfj;alkdjf" #is an example string;
newString="" #the unwanted character free string
for i in range(len(TheString)):
    if theString[i] in list:
        newString += "" #concatenate an empty string.
    else:
        newString += theString[i]

this is one way to do it. But if you are tired of keeping a list of characters that you want to remove, you can actually do it by using the order number of the strings you iterate through. the order number is the ascii value of that character. the ascii number for 0 as a char is 48 and the ascii number for lower case z is 122 so:

theString = "lkdsjf;alkd8a'asdjf;lkaheoialkdjf;ad"
newString = ""
for i in range(len(theString)):
     if ord(theString[i]) < 48 or ord(theString[i]) > 122: #ord() => ascii num.
         newString += ""
     else:
        newString += theString[i]

回答 13

这些天,我开始研究计划,现在我认为擅长递归和评估。哈哈哈 只需分享一些新方法:

首先,评估一下

print eval('string%s' % (''.join(['.replace("%s","")'%i for i in replace_list])))

第二,递归

def repn(string,replace_list):
    if replace_list==[]:
        return string
    else:
        return repn(string.replace(replace_list.pop(),""),replace_list)

print repn(string,replace_list)

嘿,别投票。我只想分享一些新想法。

These days I am diving into scheme, and now I think am good at recursing and eval. HAHAHA. Just share some new ways:

first ,eval it

print eval('string%s' % (''.join(['.replace("%s","")'%i for i in replace_list])))

second , recurse it

def repn(string,replace_list):
    if replace_list==[]:
        return string
    else:
        return repn(string.replace(replace_list.pop(),""),replace_list)

print repn(string,replace_list)

Hey ,don’t downvote. I am just want to share some new idea.


回答 14

我正在考虑为此的解决方案。首先,我将字符串输入作为列表。然后,我将替换列表中的项目。然后通过使用join命令,我将以字符串形式返回list。代码可以像这样:

def the_replacer(text):
    test = []    
    for m in range(len(text)):
        test.append(text[m])
        if test[m]==','\
        or test[m]=='!'\
        or test[m]=='.'\
        or test[m]=='\''\
        or test[m]==';':
    #....
            test[n]=''
    return ''.join(test)

这将从字符串中删除任何内容。您对此有何看法?

I am thinking about a solution for this. First I would make the string input as a list. Then I would replace the items of list. Then through using join command, I will return list as a string. The code can be like this:

def the_replacer(text):
    test = []    
    for m in range(len(text)):
        test.append(text[m])
        if test[m]==','\
        or test[m]=='!'\
        or test[m]=='.'\
        or test[m]=='\''\
        or test[m]==';':
    #....
            test[n]=''
    return ''.join(test)

This would remove anything from the string. What do you think about that?


回答 15

这是一种more_itertools方法:

import more_itertools as mit


s = "A.B!C?D_E@F#"
blacklist = ".!?_@#"

"".join(mit.flatten(mit.split_at(s, pred=lambda x: x in set(blacklist))))
# 'ABCDEF'

在这里,我们分割了在中找到的项目blacklist,将结果展平并加入字符串。

Here is a more_itertools approach:

import more_itertools as mit


s = "A.B!C?D_E@F#"
blacklist = ".!?_@#"

"".join(mit.flatten(mit.split_at(s, pred=lambda x: x in set(blacklist))))
# 'ABCDEF'

Here we split upon items found in the blacklist, flatten the results and join the string.


回答 16

Python 3,单行列表理解实现。

from string import ascii_lowercase # 'abcdefghijklmnopqrstuvwxyz'
def remove_chars(input_string, removable):
  return ''.join([_ for _ in input_string if _ not in removable])

print(remove_chars(input_string="Stack Overflow", removable=ascii_lowercase))
>>> 'S O'

Python 3, single line list comprehension implementation.

from string import ascii_lowercase # 'abcdefghijklmnopqrstuvwxyz'
def remove_chars(input_string, removable):
  return ''.join([_ for _ in input_string if _ not in removable])

print(remove_chars(input_string="Stack Overflow", removable=ascii_lowercase))
>>> 'S O'

回答 17

去掉 *%,&@!从下面的字符串:

s = "this is my string,  and i will * remove * these ** %% "
new_string = s.translate(s.maketrans('','','*%,&@!'))
print(new_string)

# output: this is my string  and i will  remove  these  

Remove *%,&@! from below string:

s = "this is my string,  and i will * remove * these ** %% "
new_string = s.translate(s.maketrans('','','*%,&@!'))
print(new_string)

# output: this is my string  and i will  remove  these  

如何检查字符串是否包含Python中列表中的元素

问题:如何检查字符串是否包含Python中列表中的元素

我有这样的事情:

extensionsToCheck = ['.pdf', '.doc', '.xls']

for extension in extensionsToCheck:
    if extension in url_string:
        print(url_string)

我想知道在Python中(不使用for循环)更优雅的方法是什么?我在想这样的事情(例如从C / C ++开始),但是没有用:

if ('.pdf' or '.doc' or '.xls') in url_string:
    print(url_string)

编辑:我有点被迫解释这与下面的问题有何不同,该问题被标记为潜在重复(所以我猜它不会关闭)。

区别是,我想检查一个字符串是否是某些字符串列表的一部分,而另一个问题是检查字符串列表中的字符串是否是另一个字符串的子字符串。类似的,但不完全相同,当您在网上寻找答案时,语义很重要。这两个问题实际上是在寻求解决彼此相反的问题。两者的解决方案虽然相同。

I have something like this:

extensionsToCheck = ['.pdf', '.doc', '.xls']

for extension in extensionsToCheck:
    if extension in url_string:
        print(url_string)

I am wondering what would be the more elegant way to do this in Python (without using the for loop)? I was thinking of something like this (like from C/C++), but it didn’t work:

if ('.pdf' or '.doc' or '.xls') in url_string:
    print(url_string)

Edit: I’m kinda forced to explain how this is different to the question below which is marked as potential duplicate (so it doesn’t get closed I guess).

The difference is, I wanted to check if a string is part of some list of strings whereas the other question is checking whether a string from a list of strings is a substring of another string. Similar, but not quite the same and semantics matter when you’re looking for an answer online IMHO. These two questions are actually looking to solve the opposite problem of one another. The solution for both turns out to be the same though.


回答 0

与一起使用生成器any,它会在第一个True上短路:

if any(ext in url_string for ext in extensionsToCheck):
    print(url_string)

编辑:我看到这个答案已经被OP接受。尽管我的解决方案可能是解决他特定问题的“足够好”的解决方案,并且是检查列表中是否有任何字符串在另一个字符串中找到的一种很好的通用方法,但请记住,这就是该解决方案的全部工作。不管在哪里找到字符串,例如在字符串的末尾。如果这很重要(通常是使用url的情况),则应查看@Wladimir Palant的答案,否则,您可能会得到误报。

Use a generator together with any, which short-circuits on the first True:

if any(ext in url_string for ext in extensionsToCheck):
    print(url_string)

EDIT: I see this answer has been accepted by OP. Though my solution may be “good enough” solution to his particular problem, and is a good general way to check if any strings in a list are found in another string, keep in mind that this is all that this solution does. It does not care WHERE the string is found e.g. in the ending of the string. If this is important, as is often the case with urls, you should look to the answer of @Wladimir Palant, or you risk getting false positives.


回答 1

extensionsToCheck = ('.pdf', '.doc', '.xls')

'test.doc'.endswith(extensionsToCheck)   # returns True

'test.jpg'.endswith(extensionsToCheck)   # returns False
extensionsToCheck = ('.pdf', '.doc', '.xls')

'test.doc'.endswith(extensionsToCheck)   # returns True

'test.jpg'.endswith(extensionsToCheck)   # returns False

回答 2

这是更好地解析正确的URL -这种方式,您可以处理http://.../file.doc?foohttp://.../foo.doc/file.exe正确。

from urlparse import urlparse
import os
path = urlparse(url_string).path
ext = os.path.splitext(path)[1]
if ext in extensionsToCheck:
  print(url_string)

It is better to parse the URL properly – this way you can handle http://.../file.doc?foo and http://.../foo.doc/file.exe correctly.

from urlparse import urlparse
import os
path = urlparse(url_string).path
ext = os.path.splitext(path)[1]
if ext in extensionsToCheck:
  print(url_string)

回答 3

如果需要单行解决方案,请使用列表推导。以下代码在扩展名为.doc,.pdf和.xls时返回包含url_string的列表,或者在不包含扩展名时返回空列表。

print [url_string for extension in extensionsToCheck if(extension in url_string)]

注意:这仅是检查它是否包含,并且在想要提取与扩展名匹配的确切单词时无用。

Use list comprehensions if you want a single line solution. The following code returns a list containing the url_string when it has the extensions .doc, .pdf and .xls or returns empty list when it doesn’t contain the extension.

print [url_string for extension in extensionsToCheck if(extension in url_string)]

NOTE: This is only to check if it contains or not and is not useful when one wants to extract the exact word matching the extensions.


回答 4

检查它是否与此正则表达式匹配:

'(\.pdf$|\.doc$|\.xls$)'

注意:如果扩展名不在URL的末尾,请删除$字符,但这会稍微削弱它

Check if it matches this regex:

'(\.pdf$|\.doc$|\.xls$)'

Note: if you extensions are not at the end of the url, remove the $ characters, but it does weaken it slightly


回答 5

这是@psun给出的列表理解答案的一种变体。

通过切换输出值,您实际上可以从列表理解中提取匹配的模式(any()@ Lauritz-v-Thaulow 的方法无法做到这一点)

extensionsToCheck = ['.pdf', '.doc', '.xls']
url_string = 'http://.../foo.doc'

print [extension for extension in extensionsToCheck if(extension in url_string)]

[‘.doc’]`

如果想要在知道匹配的模式后收集其他信息,则可以进一步插入正则表达式(当允许的模式列表太长而无法写入单个regex模式时,这可能会很有用)

print [re.search(r'(\w+)'+extension, url_string).group(0) for extension in extensionsToCheck if(extension in url_string)]

['foo.doc']

This is a variant of the list comprehension answer given by @psun.

By switching the output value, you can actually extract the matching pattern from the list comprehension (something not possible with the any() approach by @Lauritz-v-Thaulow)

extensionsToCheck = ['.pdf', '.doc', '.xls']
url_string = 'http://.../foo.doc'

print [extension for extension in extensionsToCheck if(extension in url_string)]

[‘.doc’]`

You can furthermore insert a regular expression if you want to collect additional information once the matched pattern is known (this could be useful when the list of allowed patterns is too long to write into a single regex pattern)

print [re.search(r'(\w+)'+extension, url_string).group(0) for extension in extensionsToCheck if(extension in url_string)]

['foo.doc']


如何在python-3.x中使用字典格式化字符串?

问题:如何在python-3.x中使用字典格式化字符串?

我非常喜欢使用字典来格式化字符串。它可以帮助我阅读所使用的字符串格式,也可以利用现有的字典。例如:

class MyClass:
    def __init__(self):
        self.title = 'Title'

a = MyClass()
print 'The title is %(title)s' % a.__dict__

path = '/path/to/a/file'
print 'You put your file here: %(path)s' % locals()

但是我无法弄清楚python 3.x语法是否可以这样做(或者甚至可以)。我想做以下

# Fails, KeyError 'latitude'
geopoint = {'latitude':41.123,'longitude':71.091}
print '{latitude} {longitude}'.format(geopoint)

# Succeeds
print '{latitude} {longitude}'.format(latitude=41.123,longitude=71.091)

I am a big fan of using dictionaries to format strings. It helps me read the string format I am using as well as let me take advantage of existing dictionaries. For example:

class MyClass:
    def __init__(self):
        self.title = 'Title'

a = MyClass()
print 'The title is %(title)s' % a.__dict__

path = '/path/to/a/file'
print 'You put your file here: %(path)s' % locals()

However I cannot figure out the python 3.x syntax for doing the same (or if that is even possible). I would like to do the following

# Fails, KeyError 'latitude'
geopoint = {'latitude':41.123,'longitude':71.091}
print '{latitude} {longitude}'.format(geopoint)

# Succeeds
print '{latitude} {longitude}'.format(latitude=41.123,longitude=71.091)

回答 0

由于问题是特定于Python 3的,因此这里使用的从Python 3.6开始可用的新f字符串语法

>>> geopoint = {'latitude':41.123,'longitude':71.091}
>>> print(f'{geopoint["latitude"]} {geopoint["longitude"]}')
41.123 71.091

注意外部单引号和内部双引号(您也可以采用其他方法)。

Since the question is specific to Python 3, here’s using the new f-string syntax, available since Python 3.6:

>>> geopoint = {'latitude':41.123,'longitude':71.091}
>>> print(f'{geopoint["latitude"]} {geopoint["longitude"]}')
41.123 71.091

Note the outer single quotes and inner double quotes (you could also do it the other way around).


回答 1

这对你有好处吗?

geopoint = {'latitude':41.123,'longitude':71.091}
print('{latitude} {longitude}'.format(**geopoint))

Is this good for you?

geopoint = {'latitude':41.123,'longitude':71.091}
print('{latitude} {longitude}'.format(**geopoint))

回答 2

要将字典解压缩为关键字参数,请使用**。此外,新型格式支持引用对象的属性和映射项:

'{0[latitude]} {0[longitude]}'.format(geopoint)
'The title is {0.title}s'.format(a) # the a from your first example

To unpack a dictionary into keyword arguments, use **. Also,, new-style formatting supports referring to attributes of objects and items of mappings:

'{0[latitude]} {0[longitude]}'.format(geopoint)
'The title is {0.title}s'.format(a) # the a from your first example

回答 3

由于Python 3.0和3.1已停产,而且没有人使用它们,因此您可以并且应该使用str.format_map(mapping)(Python 3.2+):

与相似str.format(**mapping)除了直接使用映射而不将其复制到dict。例如,如果映射是dict子类,则这很有用。

这意味着您可以使用例如defaultdict为丢失的键设置(并返回)默认值的a:

>>> from collections import defaultdict
>>> vals = defaultdict(lambda: '<unset>', {'bar': 'baz'})
>>> 'foo is {foo} and bar is {bar}'.format_map(vals)
'foo is <unset> and bar is baz'

即使提供的映射是dict,而不是子类,也可能会稍快一些。

鉴于给定,差异并不大

>>> d = dict(foo='x', bar='y', baz='z')

然后

>>> 'foo is {foo}, bar is {bar} and baz is {baz}'.format_map(d)

约比10 ns(2%)快

>>> 'foo is {foo}, bar is {bar} and baz is {baz}'.format(**d)

在我的Python 3.4.3上。当字典中有更多键时,差异可能会更大,并且


注意,格式语言比它灵活得多。它们可以包含索引表达式,属性访问等,因此您可以格式化整个对象或其中两个:

>>> p1 = {'latitude':41.123,'longitude':71.091}
>>> p2 = {'latitude':56.456,'longitude':23.456}
>>> '{0[latitude]} {0[longitude]} - {1[latitude]} {1[longitude]}'.format(p1, p2)
'41.123 71.091 - 56.456 23.456'

从3.6开始,您也可以使用插值字符串:

>>> f'lat:{p1["latitude"]} lng:{p1["longitude"]}'
'lat:41.123 lng:71.091'

您只需要记住在嵌套引号中使用其他引号字符。这种方法的另一个好处是,它比调用格式化方法要快得多

As Python 3.0 and 3.1 are EOL’ed and no one uses them, you can and should use str.format_map(mapping) (Python 3.2+):

Similar to str.format(**mapping), except that mapping is used directly and not copied to a dict. This is useful if for example mapping is a dict subclass.

What this means is that you can use for example a defaultdict that would set (and return) a default value for keys that are missing:

>>> from collections import defaultdict
>>> vals = defaultdict(lambda: '<unset>', {'bar': 'baz'})
>>> 'foo is {foo} and bar is {bar}'.format_map(vals)
'foo is <unset> and bar is baz'

Even if the mapping provided is a dict, not a subclass, this would probably still be slightly faster.

The difference is not big though, given

>>> d = dict(foo='x', bar='y', baz='z')

then

>>> 'foo is {foo}, bar is {bar} and baz is {baz}'.format_map(d)

is about 10 ns (2 %) faster than

>>> 'foo is {foo}, bar is {bar} and baz is {baz}'.format(**d)

on my Python 3.4.3. The difference would probably be larger as more keys are in the dictionary, and


Note that the format language is much more flexible than that though; they can contain indexed expressions, attribute accesses and so on, so you can format a whole object, or 2 of them:

>>> p1 = {'latitude':41.123,'longitude':71.091}
>>> p2 = {'latitude':56.456,'longitude':23.456}
>>> '{0[latitude]} {0[longitude]} - {1[latitude]} {1[longitude]}'.format(p1, p2)
'41.123 71.091 - 56.456 23.456'

Starting from 3.6 you can use the interpolated strings too:

>>> f'lat:{p1["latitude"]} lng:{p1["longitude"]}'
'lat:41.123 lng:71.091'

You just need to remember to use the other quote characters within the nested quotes. Another upside of this approach is that it is much faster than calling a formatting method.


回答 4

print("{latitude} {longitude}".format(**geopoint))
print("{latitude} {longitude}".format(**geopoint))

回答 5

Python 2语法也可以在Python 3中使用:

>>> class MyClass:
...     def __init__(self):
...         self.title = 'Title'
... 
>>> a = MyClass()
>>> print('The title is %(title)s' % a.__dict__)
The title is Title
>>> 
>>> path = '/path/to/a/file'
>>> print('You put your file here: %(path)s' % locals())
You put your file here: /path/to/a/file

The Python 2 syntax works in Python 3 as well:

>>> class MyClass:
...     def __init__(self):
...         self.title = 'Title'
... 
>>> a = MyClass()
>>> print('The title is %(title)s' % a.__dict__)
The title is Title
>>> 
>>> path = '/path/to/a/file'
>>> print('You put your file here: %(path)s' % locals())
You put your file here: /path/to/a/file

回答 6

geopoint = {'latitude':41.123,'longitude':71.091}

# working examples.
print(f'{geopoint["latitude"]} {geopoint["longitude"]}') # from above answer
print('{geopoint[latitude]} {geopoint[longitude]}'.format(geopoint=geopoint)) # alternate for format method  (including dict name in string).
print('%(latitude)s %(longitude)s'%geopoint) # thanks @tcll
geopoint = {'latitude':41.123,'longitude':71.091}

# working examples.
print(f'{geopoint["latitude"]} {geopoint["longitude"]}') # from above answer
print('{geopoint[latitude]} {geopoint[longitude]}'.format(geopoint=geopoint)) # alternate for format method  (including dict name in string).
print('%(latitude)s %(longitude)s'%geopoint) # thanks @tcll

回答 7

大多数答案仅格式化dict的值。

如果还要将密钥格式化为字符串,则可以使用dict.items()

geopoint = {'latitude':41.123,'longitude':71.091}
print("{} {}".format(*geopoint.items()))

输出:

(“纬度”,41.123)(“经度”,71.091)

如果要以套利方式格式化,即不显示元组之类的键值:

from functools import reduce
print("{} is {} and {} is {}".format(*reduce((lambda x, y: x + y), [list(item) for item in geopoint.items()])))

输出:

纬度为41.123,经度为71.091

Most answers formatted only the values of the dict.

If you want to also format the key into the string you can use dict.items():

geopoint = {'latitude':41.123,'longitude':71.091}
print("{} {}".format(*geopoint.items()))

Output:

(‘latitude’, 41.123) (‘longitude’, 71.091)

If you want to format in an arbitry way, that is, not showing the key-values like tuples:

from functools import reduce
print("{} is {} and {} is {}".format(*reduce((lambda x, y: x + y), [list(item) for item in geopoint.items()])))

Output:

latitude is 41.123 and longitude is 71.091


Python __str__与__unicode__

问题:Python __str__与__unicode__

有没有时,你应该实现一个python约定__str__()__unicode__()。我已经看到类重写的__unicode__()频率高于,__str__()但似乎不一致。当实施一个相对于另一个更好时,是否有特定的规则?实施这两种方法是否必要/良好做法?

Is there a python convention for when you should implement __str__() versus __unicode__(). I’ve seen classes override __unicode__() more frequently than __str__() but it doesn’t appear to be consistent. Are there specific rules when it is better to implement one versus the other? Is it necessary/good practice to implement both?


回答 0

__str__()是旧方法-它返回字节。__unicode__()是新的首选方法-它返回字符。名称有些混乱,但是在2.x中,出于兼容性原因,我们坚持使用它们。通常,您应该将所有字符串格式都放在中__unicode__(),并创建一个存根__str__()方法:

def __str__(self):
    return unicode(self).encode('utf-8')

在3.0中,str包含字符,因此将__bytes__()和命名为相同的方法__str__()。这些行为符合预期。

__str__() is the old method — it returns bytes. __unicode__() is the new, preferred method — it returns characters. The names are a bit confusing, but in 2.x we’re stuck with them for compatibility reasons. Generally, you should put all your string formatting in __unicode__(), and create a stub __str__() method:

def __str__(self):
    return unicode(self).encode('utf-8')

In 3.0, str contains characters, so the same methods are named __bytes__() and __str__(). These behave as expected.


回答 1

如果我不特别关心给定类的微优化字符串化,那么我将始终__unicode__只实施它,因为它更笼统。当我确实关心此类微小的性能问题(这是exceptions,不是规则)时,__str__仅(当我可以证明在字符串化输出中绝不会出现非ASCII字符时)或两者(当两者都可能时)可能会救命。

我认为这些是牢固的原则,但实际上知道这是很常见的,只有ASCII字符会不做任何努力来证明它(例如,字符串形式只有数字,标点符号,并且可能是短的ASCII名称;-)在这种情况下,直接采用“公正__str__”方法是很典型的做法(但如果我与一个编程团队合作,提出了一项本地准则来避免这种情况,我将对该提案+1,因为在这些问题上很容易犯错,并且“过早的优化是编程中万恶之源” ;-)。

If I didn’t especially care about micro-optimizing stringification for a given class I’d always implement __unicode__ only, as it’s more general. When I do care about such minute performance issues (which is the exception, not the rule), having __str__ only (when I can prove there never will be non-ASCII characters in the stringified output) or both (when both are possible), might help.

These I think are solid principles, but in practice it’s very common to KNOW there will be nothing but ASCII characters without doing effort to prove it (e.g. the stringified form only has digits, punctuation, and maybe a short ASCII name;-) in which case it’s quite typical to move on directly to the “just __str__” approach (but if a programming team I worked with proposed a local guideline to avoid that, I’d be +1 on the proposal, as it’s easy to err in these matters AND “premature optimization is the root of all evil in programming”;-).


回答 2

随着世界变得越来越小,您遇到的任何字符串都有可能最终包含Unicode。因此,对于任何新应用,您至少应提供__unicode__()__str__()然后,您是否还要覆盖也只是一个品味问题。

With the world getting smaller, chances are that any string you encounter will contain Unicode eventually. So for any new apps, you should at least provide __unicode__(). Whether you also override __str__() is then just a matter of taste.


回答 3

如果您在Django中同时使用python2和python3,则建议使用python_2_unicode_compatible装饰器:

Django提供了一种简单的方法来定义可在Python 2和3上使用的str()和 unicode()方法,您必须定义一个返回文本的str()方法并应用python_2_unicode_compatible()装饰器。

如前面对另一个答案的注释中所述,某些版本的future.utils也支持此装饰器。在我的系统上,我需要为python2安装一个新的future模块,并为python3安装future。之后,这是一个功能示例:

#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

@python_2_unicode_compatible
class SomeClass():
    def __str__(self):
        return "Called __str__"


if __name__ == "__main__":
    some_inst = SomeClass()
    print(some_inst)
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")
    else:
        print(unicode(some_inst))

这是示例输出(其中venv2 / venv3是virtualenv实例):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

If you are working in both python2 and python3 in Django, I recommend the python_2_unicode_compatible decorator:

Django provides a simple way to define str() and unicode() methods that work on Python 2 and 3: you must define a str() method returning text and to apply the python_2_unicode_compatible() decorator.

As noted in earlier comments to another answer, some versions of future.utils also support this decorator. On my system, I needed to install a newer future module for python2 and install future for python3. After that, then here is a functional example:

#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

@python_2_unicode_compatible
class SomeClass():
    def __str__(self):
        return "Called __str__"


if __name__ == "__main__":
    some_inst = SomeClass()
    print(some_inst)
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")
    else:
        print(unicode(some_inst))

Here is example output (where venv2/venv3 are virtualenv instances):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

回答 4

Python 2: 仅实现__str __(),并返回unicode。

什么时候__unicode__()省略,有人打电话unicode(o)u"%s"%o,Python的呼叫o.__str__()并转换为Unicode使用系统编码。(请参阅的文档__unicode__()。)

相反的说法是不正确的。如果实施__unicode__()但未__str__(),则当有人调用str(o)或时"%s"%o,Python返回repr(o)


基本原理

为什么unicode要从中返回a__str__()
如果__str__()返回unicode,Python会自动str使用系统编码将其转换为。

有什么好处?
①它使您不必担心系统编码是什么(即locale.getpreferredencoeding(…))。就个人而言,这不仅麻烦,而且我认为系统无论如何都要注意这一点。②如果小心,您的代码可能会与Python 3相互兼容,其中__str__()返回unicode。

从名为的函数中返回unicode是骗人的 __str__()
一点。但是,您可能已经在这样做了。如果你有from __future__ import unicode_literals位于文件的顶部,则很有可能在不知道的情况下返回unicode。

那么Python 3呢?
Python 3不使用__unicode__()。但是,如果您实现__str__()了使其在Python 2或Python 3下返回unicode的功能,那么那部分代码将是交叉兼容的。

如果我想unicode(o)与之有本质区别str()怎么办?
同时实现__str__()(可能返回str)和__unicode__()。我想这很少见,但您可能希望获得实质上不同的输出(例如,特殊字符的ASCII版本,例如":)"for u"☺")。

我意识到有些人可能会发现这一争议。

Python 2: Implement __str__() only, and return a unicode.

When __unicode__() is omitted and someone calls unicode(o) or u"%s"%o, Python calls o.__str__() and converts to unicode using the system encoding. (See documentation of __unicode__().)

The opposite is not true. If you implement __unicode__() but not __str__(), then when someone calls str(o) or "%s"%o, Python returns repr(o).


Rationale

Why would it work to return a unicode from __str__()?
If __str__() returns a unicode, Python automatically converts it to str using the system encoding.

What’s the benefit?
① It frees you from worrying about what the system encoding is (i.e., locale.getpreferredencoeding(…)). Not only is that messy, personally, but I think it’s something the system should take care of anyway. ② If you are careful, your code may come out cross-compatible with Python 3, in which __str__() returns unicode.

Isn’t it deceptive to return a unicode from a function called __str__()?
A little. However, you might be already doing it. If you have from __future__ import unicode_literals at the top of your file, there’s a good chance you’re returning a unicode without even knowing it.

What about Python 3?
Python 3 does not use __unicode__(). However, if you implement __str__() so that it returns unicode under either Python 2 or Python 3, then that part of your code will be cross-compatible.

What if I want unicode(o) to be substantively different from str()?
Implement both __str__() (possibly returning str) and __unicode__(). I imagine this would be rare, but you might want substantively different output (e.g., ASCII versions of special characters, like ":)" for u"☺").

I realize some may find this controversial.


回答 5

值得向那些不熟悉该__unicode__功能的人指出一些在Python 2.x中围绕它的默认行为,尤其是与并排定义时__str__

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))

产生以下控制台输出…

__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3

现在,当我取消注释该__str__方法时

__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3

It’s worth pointing out to those unfamiliar with the __unicode__ function some of the default behaviors surrounding it back in Python 2.x, especially when defined side by side with __str__.

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))

yields the following console output…

__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3

Now when I uncomment out the __str__ method

__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3

如何搜索和替换文件中的文本?

问题:如何搜索和替换文件中的文本?

如何使用Python 3搜索和替换文件中的文本?

这是我的代码:

import os
import sys
import fileinput

print ("Text to search for:")
textToSearch = input( "> " )

print ("Text to replace it with:")
textToReplace = input( "> " )

print ("File to perform Search-Replace on:")
fileToSearch  = input( "> " )
#fileToSearch = 'D:\dummy1.txt'

tempFile = open( fileToSearch, 'r+' )

for line in fileinput.input( fileToSearch ):
    if textToSearch in line :
        print('Match Found')
    else:
        print('Match Not Found!!')
    tempFile.write( line.replace( textToSearch, textToReplace ) )
tempFile.close()


input( '\n\n Press Enter to exit...' )

输入文件:

hi this is abcd hi this is abcd
This is dummy text file.
This is how search and replace works abcd

当我在上面的输入文件中搜索并将“ ram”替换为“ abcd”时,它起了一种魅力。但是,反之亦然,即用“ ram”替换“ abcd”时,一些垃圾字符会保留在末尾。

用“ ram”代替“ abcd”

hi this is ram hi this is ram
This is dummy text file.
This is how search and replace works rambcd

How do I search and replace text in a file using Python 3?

Here is my code:

import os
import sys
import fileinput

print ("Text to search for:")
textToSearch = input( "> " )

print ("Text to replace it with:")
textToReplace = input( "> " )

print ("File to perform Search-Replace on:")
fileToSearch  = input( "> " )
#fileToSearch = 'D:\dummy1.txt'

tempFile = open( fileToSearch, 'r+' )

for line in fileinput.input( fileToSearch ):
    if textToSearch in line :
        print('Match Found')
    else:
        print('Match Not Found!!')
    tempFile.write( line.replace( textToSearch, textToReplace ) )
tempFile.close()


input( '\n\n Press Enter to exit...' )

Input file:

hi this is abcd hi this is abcd
This is dummy text file.
This is how search and replace works abcd

When I search and replace ‘ram’ by ‘abcd’ in above input file, it works as a charm. But when I do it vice-versa i.e. replacing ‘abcd’ by ‘ram’, some junk characters are left at the end.

Replacing ‘abcd’ by ‘ram’

hi this is ram hi this is ram
This is dummy text file.
This is how search and replace works rambcd

回答 0

fileinput已经支持就地编辑。stdout在这种情况下,它将重定向到文件:

#!/usr/bin/env python3
import fileinput

with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(text_to_search, replacement_text), end='')

fileinput already supports inplace editing. It redirects stdout to the file in this case:

#!/usr/bin/env python3
import fileinput

with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(text_to_search, replacement_text), end='')

回答 1

正如michaelb958指出的那样,您不能用其他长度的数据替换在原处,因为这会使其余部分无法正确放置。我不同意其他海报,建议您从一个文件中读取并写入另一个文件。相反,我将文件读入内存,修复数据,然后在单独的步骤中将其写出到同一文件中。

# Read in the file
with open('file.txt', 'r') as file :
  filedata = file.read()

# Replace the target string
filedata = filedata.replace('ram', 'abcd')

# Write the file out again
with open('file.txt', 'w') as file:
  file.write(filedata)

除非您要处理的海量文件太大而无法一次加载到内存中,否则除非担心在将数据写入文件的第二步过程中该过程中断,否则您可能会担心数据丢失。

As pointed out by michaelb958, you cannot replace in place with data of a different length because this will put the rest of the sections out of place. I disagree with the other posters suggesting you read from one file and write to another. Instead, I would read the file into memory, fix the data up, and then write it out to the same file in a separate step.

# Read in the file
with open('file.txt', 'r') as file :
  filedata = file.read()

# Replace the target string
filedata = filedata.replace('ram', 'abcd')

# Write the file out again
with open('file.txt', 'w') as file:
  file.write(filedata)

Unless you’ve got a massive file to work with which is too big to load into memory in one go, or you are concerned about potential data loss if the process is interrupted during the second step in which you write data to the file.


回答 2

正如杰克·艾德利(Jack Aidley)张贴的文章和JF Sebastian指出的那样,此代码不起作用:

 # Read in the file
filedata = None
with file = open('file.txt', 'r') :
  filedata = file.read()

# Replace the target string
filedata.replace('ram', 'abcd')

# Write the file out again
with file = open('file.txt', 'w') :
  file.write(filedata)`

但是此代码将起作用(我已经对其进行了测试):

f = open(filein,'r')
filedata = f.read()
f.close()

newdata = filedata.replace("old data","new data")

f = open(fileout,'w')
f.write(newdata)
f.close()

使用此方法,filein和fileout可以是同一文件,因为Python 3.3在打开进行写操作时会覆盖该文件。

As Jack Aidley had posted and J.F. Sebastian pointed out, this code will not work:

 # Read in the file
filedata = None
with file = open('file.txt', 'r') :
  filedata = file.read()

# Replace the target string
filedata.replace('ram', 'abcd')

# Write the file out again
with file = open('file.txt', 'w') :
  file.write(filedata)`

But this code WILL work (I’ve tested it):

f = open(filein,'r')
filedata = f.read()
f.close()

newdata = filedata.replace("old data","new data")

f = open(fileout,'w')
f.write(newdata)
f.close()

Using this method, filein and fileout can be the same file, because Python 3.3 will overwrite the file upon opening for write.


回答 3

你可以这样替换

f1 = open('file1.txt', 'r')
f2 = open('file2.txt', 'w')
for line in f1:
    f2.write(line.replace('old_text', 'new_text'))
f1.close()
f2.close()

You can do the replacement like this

f1 = open('file1.txt', 'r')
f2 = open('file2.txt', 'w')
for line in f1:
    f2.write(line.replace('old_text', 'new_text'))
f1.close()
f2.close()

回答 4

您也可以使用pathlib

from pathlib2 import Path
path = Path(file_to_search)
text = path.read_text()
text = text.replace(text_to_search, replacement_text)
path.write_text(text)

You can also use pathlib.

from pathlib2 import Path
path = Path(file_to_search)
text = path.read_text()
text = text.replace(text_to_search, replacement_text)
path.write_text(text)

回答 5

使用单个with块,您可以搜索和替换文本:

with open('file.txt','r+') as f:
    filedata = f.read()
    filedata = filedata.replace('abc','xyz')
    f.truncate(0)
    f.write(filedata)

With a single with block, you can search and replace your text:

with open('file.txt','r+') as f:
    filedata = f.read()
    filedata = filedata.replace('abc','xyz')
    f.truncate(0)
    f.write(filedata)

回答 6

您的问题源于读取和写入同一文件。无需打开fileToSearch进行写入,而是打开实际的临时文件,然后在完成并关闭后tempFile,使用os.rename将新文件移到上方fileToSearch

Your problem stems from reading from and writing to the same file. Rather than opening fileToSearch for writing, open an actual temporary file and then after you’re done and have closed tempFile, use os.rename to move the new file over fileToSearch.


回答 7

(pip install python-util)

from pyutil import filereplace

filereplace("somefile.txt","abcd","ram")

第二个参数(要替换的事物,例如“ abcd”也可以是正则表达式)
将替换所有出现的事件

(pip install python-util)

from pyutil import filereplace

filereplace("somefile.txt","abcd","ram")

The second parameter (the thing to be replaced, e.g. “abcd” can also be a regex)
Will replace all occurences


回答 8

我的变体,在整个文件上一次一个字。

我将其读入内存。

def replace_word(infile,old_word,new_word):
    if not os.path.isfile(infile):
        print ("Error on replace_word, not a regular file: "+infile)
        sys.exit(1)

    f1=open(infile,'r').read()
    f2=open(infile,'w')
    m=f1.replace(old_word,new_word)
    f2.write(m)

My variant, one word at a time on the entire file.

I read it into memory.

def replace_word(infile,old_word,new_word):
    if not os.path.isfile(infile):
        print ("Error on replace_word, not a regular file: "+infile)
        sys.exit(1)

    f1=open(infile,'r').read()
    f2=open(infile,'w')
    m=f1.replace(old_word,new_word)
    f2.write(m)

回答 9

我已经做到了:

#!/usr/bin/env python3

import fileinput
import os

Dir = input ("Source directory: ")
os.chdir(Dir)

Filelist = os.listdir()
print('File list: ',Filelist)

NomeFile = input ("Insert file name: ")

CarOr = input ("Text to search: ")

CarNew = input ("New text: ")

with fileinput.FileInput(NomeFile, inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(CarOr, CarNew), end='')

file.close ()

I have done this:

#!/usr/bin/env python3

import fileinput
import os

Dir = input ("Source directory: ")
os.chdir(Dir)

Filelist = os.listdir()
print('File list: ',Filelist)

NomeFile = input ("Insert file name: ")

CarOr = input ("Text to search: ")

CarNew = input ("New text: ")

with fileinput.FileInput(NomeFile, inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(CarOr, CarNew), end='')

file.close ()

回答 10

我稍微修改了Jayram Singh的帖子,以替换每个“!”实例。字符到我想随每个实例增加的数字。认为这对希望修改每行出现多次且要迭代的字符可能会有所帮助。希望能对某人有所帮助。PS-我对编码非常陌生,因此如果我的帖子在任何方面都不适当,我深表歉意,但这对我有用。

f1 = open('file1.txt', 'r')
f2 = open('file2.txt', 'w')
n = 1  

# if word=='!'replace w/ [n] & increment n; else append same word to     
# file2

for line in f1:
    for word in line:
        if word == '!':
            f2.write(word.replace('!', f'[{n}]'))
            n += 1
        else:
            f2.write(word)
f1.close()
f2.close()

I modified Jayram Singh’s post slightly in order to replace every instance of a ‘!’ character to a number which I wanted to increment with each instance. Thought it might be helpful to someone who wanted to modify a character that occurred more than once per line and wanted to iterate. Hope that helps someone. PS- I’m very new at coding so apologies if my post is inappropriate in any way, but this worked for me.

f1 = open('file1.txt', 'r')
f2 = open('file2.txt', 'w')
n = 1  

# if word=='!'replace w/ [n] & increment n; else append same word to     
# file2

for line in f1:
    for word in line:
        if word == '!':
            f2.write(word.replace('!', f'[{n}]'))
            n += 1
        else:
            f2.write(word)
f1.close()
f2.close()

回答 11

def word_replace(filename,old,new):
    c=0
    with open(filename,'r+',encoding ='utf-8') as f:
        a=f.read()
        b=a.split()
        for i in range(0,len(b)):
            if b[i]==old:
                c=c+1
        old=old.center(len(old)+2)
        new=new.center(len(new)+2)
        d=a.replace(old,new,c)
        f.truncate(0)
        f.seek(0)
        f.write(d)
    print('All words have been replaced!!!')
def word_replace(filename,old,new):
    c=0
    with open(filename,'r+',encoding ='utf-8') as f:
        a=f.read()
        b=a.split()
        for i in range(0,len(b)):
            if b[i]==old:
                c=c+1
        old=old.center(len(old)+2)
        new=new.center(len(new)+2)
        d=a.replace(old,new,c)
        f.truncate(0)
        f.seek(0)
        f.write(d)
    print('All words have been replaced!!!')

回答 12

像这样:

def find_and_replace(file, word, replacement):
  with open(file, 'r+') as f:
    text = f.read()
    f.write(text.replace(word, replacement))

Like so:

def find_and_replace(file, word, replacement):
  with open(file, 'r+') as f:
    text = f.read()
    f.write(text.replace(word, replacement))

回答 13

def findReplace(find, replace):

    import os 

    src = os.path.join(os.getcwd(), os.pardir) 

    for path, dirs, files in os.walk(os.path.abspath(src)):

        for name in files: 

            if name.endswith('.py'): 

                filepath = os.path.join(path, name)

                with open(filepath) as f: 

                    s = f.read()

                s = s.replace(find, replace) 

                with open(filepath, "w") as f:

                    f.write(s) 
def findReplace(find, replace):

    import os 

    src = os.path.join(os.getcwd(), os.pardir) 

    for path, dirs, files in os.walk(os.path.abspath(src)):

        for name in files: 

            if name.endswith('.py'): 

                filepath = os.path.join(path, name)

                with open(filepath) as f: 

                    s = f.read()

                s = s.replace(find, replace) 

                with open(filepath, "w") as f:

                    f.write(s) 

如何检查Python中的字符串是否为ASCII?

问题:如何检查Python中的字符串是否为ASCII?

我想检查字符串是否为ASCII。

我知道ord(),但是当我尝试时ord('é'),我知道了TypeError: ord() expected a character, but string of length 2 found。我了解这是由我构建Python的方式引起的(如ord()的文档中所述)。

还有另一种检查方法吗?

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()‘s documentation).

Is there another way to check?


回答 0

def is_ascii(s):
    return all(ord(c) < 128 for c in s)
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

回答 1

我认为您不是在问正确的问题-

python中的字符串没有与’ascii’,utf-8或任何其他编码对应的属性。字符串的来源(无论您是从文件中读取字符串,还是从键盘输入等等)可能已经在ASCII中编码了一个unicode字符串以生成您的字符串,但这就是您需要答案的地方。

也许您会问的问题是:“此字符串是在ASCII中编码unicode字符串的结果吗?” -您可以尝试以下方法回答:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

I think you are not asking the right question–

A string in python has no property corresponding to ‘ascii’, utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that’s where you need to go for an answer.

Perhaps the question you can ask is: “Is this string the result of encoding a unicode string in ascii?” — This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

回答 2

Python 3方式:

isascii = lambda s: len(s) == len(s.encode())

要进行检查,请传递测试字符串:

str1 = "♥O◘♦♥O◘♦"
str2 = "Python"

print(isascii(str1)) -> will return False
print(isascii(str2)) -> will return True

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

回答 3

Python 3.7的新功能(bpo32677

没有更多的无聊/对字符串低效ASCII检查,新的内置str/ bytes/ bytearray方法- .isascii()将检查字符串是ASCII。

print("is this ascii?".isascii())
# True

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method – .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

回答 4

最近遇到了类似的情况-供将来参考

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

您可以使用:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

Ran into something like this recently – for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

回答 5

Vincent Marchetti的想法正确,但str.decode已在Python 3中弃用。在Python 3中,您可以使用以下命令进行相同的测试str.encode

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

请注意,您要捕获的异常也已从更改UnicodeDecodeErrorUnicodeEncodeError

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.


回答 6

您的问题不正确;您看到的错误不是您构建python的结果,而是字节字符串和unicode字符串之间的混淆。

字节字符串(例如python语法中的“ foo”或“ bar”)是八位字节序列;0-255之间的数字。Unicode字符串(例如u“ foo”或u’bar’)是unicode代码点的序列;从0-1112064开始的数字。但是您似乎对字符é感兴趣,该字符é(在您的终端中)是一个多字节序列,代表一个字符。

代替ord(u'é'),试试这个:

>>> [ord(x) for x in u'é']

这就告诉您“é”代表的代码点顺序。它可以给您[233],也可以给您[101,770]。

除了chr()扭转这种情况,还有unichr()

>>> unichr(233)
u'\xe9'

该字符实际上可以表示为单个或多个unicode“代码点”,它们本身表示字素或字符。它可以是“带有重音符号的e(即代码点233)”,也可以是“ e”(编码点101),后跟“上一个字符具有重音符号”(代码点770)。因此,这个完全相同的字符可以表示为Python数据结构u'e\u0301'u'\u00e9'

大多数情况下,您不必关心这一点,但是如果您要遍历unicode字符串,这可能会成为问题,因为迭代是通过代码点而不是通过可分解字符进行的。换句话说,len(u'e\u0301') == 2len(u'\u00e9') == 1。如果您认为这很重要,可以使用来在合成和分解形式之间进行转换unicodedata.normalize

Unicode词汇表可以指出每个特定术语如何指代文本表示形式的不同部分,这对于理解其中的一些问题可能是有用的指南,这比许多程序员意识到的要复杂得多。

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. “foo”, or ‘bar’, in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u”foo” or u’bar’) are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points “é” represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode “code points”, which themselves represent either graphemes or characters. It’s either “e with an acute accent (i.e., code point 233)”, or “e” (code point 101), followed by “an acute accent on the previous character” (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn’t have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.


回答 7

怎么样呢?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

回答 8

我在尝试确定如何使用/编码/解码我不确定其编码的字符串(以及如何转义/转换该字符串中的特殊字符)时发现了这个问题。

我的第一步应该是检查字符串的类型-我不知道在那里可以从类型中获取有关其格式的良好数据。 这个答案非常有帮助,并且是我问题的真正根源。

如果您变得粗鲁和执着

UnicodeDecodeError:’ascii’编解码器无法解码位置263的字节0xc3:序数不在范围内(128)

尤其是在进行编码时,请确保您不尝试对已经是unicode的字符串进行unicode()-出于某种可怕的原因,您会遇到ascii编解码器错误。(另请参阅“ Python厨房食谱 ”和“ Python文档”教程,以更好地了解它的可怕程度。)

最终,我确定我想做的是:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

在调试中也很有帮助,是将我文件中的默认编码设置为utf-8(将其放在python文件的开头):

# -*- coding: utf-8 -*-

这样,您就可以测试特殊字符(’àéç’),而不必使用它们的Unicode转义符(u’\ xe0 \ xe9 \ xe7’)。

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn’t sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn’t realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you’re getting a rude and persistent

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you’re ENCODING, make sure you’re not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters (‘àéç’) without having to use their unicode escapes (u’\xe0\xe9\xe7′).

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

回答 9

要从Python 2.6(和Python 3.x)中改进Alexander的解决方案,可以使用帮助器模块curses.ascii并使用curses.ascii.isascii()函数或其他各种功能:https ://docs.python.org/2.6/ library / curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

To improve Alexander’s solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

回答 10

您可以使用接受Posix标准[[:ASCII:]]定义的正则表达式库。

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.


回答 11

strPython中的字符串(-type)是一系列字节。有没有办法从看串只是告诉的这一系列字节是否代表一个ASCII字符串,在8位字符集,如ISO-8859-1或字符串使用UTF-8或UTF-16或任何编码的字符串。

但是,如果您知道所使用的编码,则可以decode将str转换为unicode字符串,然后使用正则表达式(或循环)检查其是否包含您所关注范围之外的字符。

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.


回答 12

就像@RogerDahl的答案一样,但通过否定字符类并使用search代替find_allor 来短路更有效match

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

我认为正则表达式已对此进行了优化。

Like @RogerDahl’s answer but it’s more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.


回答 13

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

要将空字符串包含为ASCII,请将更改+*

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.


回答 14

为防止代码崩溃,您可能需要使用a try-except来捕获TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

例如

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

回答 15

我使用以下命令确定字符串是ascii还是unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

然后只需使用条件块来定义函数:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

如何将int转换为十六进制字符串?

问题:如何将int转换为十六进制字符串?

我想将一个整数(将为<= 255)用于十六进制字符串表示形式

例如:我想通过65并离开'\x41',或255获得'\xff'

我曾尝试使用struct.pack('c',65来执行此操作),但9由于它想采用单个字符串,因此上述内容均会阻塞。

I want to take an integer (that will be <= 255), to a hex string representation

e.g.: I want to pass in 65 and get out '\x41', or 255 and get '\xff'.

I’ve tried doing this with the struct.pack('c',65), but that chokes on anything above 9 since it wants to take in a single character string.


回答 0

您正在寻找chr功能。

您似乎正在混合使用整数的十进制表示形式和整数的十六进制表示形式,因此尚不清楚您需要什么。根据您的描述,我认为这些片段之一可以显示您想要的内容。

>>> chr(0x65) == '\x65'
True


>>> hex(65)
'0x41'
>>> chr(65) == '\x41'
True

请注意,这与包含整数(十六进制)的字符串完全不同。如果那是您想要的,请使用hex内置的。

You are looking for the chr function.

You seem to be mixing decimal representations of integers and hex representations of integers, so it’s not entirely clear what you need. Based on the description you gave, I think one of these snippets shows what you want.

>>> chr(0x65) == '\x65'
True


>>> hex(65)
'0x41'
>>> chr(65) == '\x41'
True

Note that this is quite different from a string containing an integer as hex. If that is what you want, use the hex builtin.


回答 1

这会将整数转换为带有0x前缀的2位十六进制字符串:

strHex = "0x%0.2X" % 255

This will convert an integer to a 2 digit hex string with the 0x prefix:

strHex = "0x%0.2X" % 255

回答 2

hex()

hex(255)  # 0xff

如果您真的想\在前台就可以:

print '\\' + hex(255)[1:]

What about hex()?

hex(255)  # 0xff

If you really want to have \ in front you can do:

print '\\' + hex(255)[1:]

回答 3

尝试:

"0x%x" % 255 # => 0xff

要么

"0x%X" % 255 # => 0xFF

Python文档说:“把它放在枕头底下:http : //docs.python.org/library/index.html

Try:

"0x%x" % 255 # => 0xff

or

"0x%X" % 255 # => 0xFF

Python Documentation says: “keep this under Your pillow: http://docs.python.org/library/index.html


回答 4

让我添加这一点,因为有时您只想用一位数字表示:

'{:x}'.format(15)
> f

现在,使用新的f''格式字符串,您可以执行以下操作:

f'{15:x}'
> f

注意:最初的’f’ f'{15:x}'是表示格式字符串

Let me add this one, because sometimes you just want the single digit representation

( x can be lower, ‘x’, or uppercase, ‘X’, the choice determines if the output letters are upper or lower.):

'{:x}'.format(15)
> f

And now with the new f'' format strings you can do:

f'{15:x}'
> f

To add 0 padding you can use 0>n:

f'{2034:0>4X}'
> 07F2

NOTE: the initial ‘f’ in f'{15:x}' is to signify a format string


回答 5

如果要打包一个值小于255的结构(一个无符号字节,uint8_t)并以一个字符的字符串结尾,则可能要寻找格式B而不是c。C将字符转换为字符串(本身不太有用),而B将整数转换。

struct.pack('B', 65)

(是的,65是\ x41,而不是\ x65。)

struct类还将方便地处理通讯或其他用途的字节序。

If you want to pack a struct with a value <255 (one byte unsigned, uint8_t) and end up with a string of one character, you’re probably looking for the format B instead of c. C converts a character to a string (not too useful by itself) while B converts an integer.

struct.pack('B', 65)

(And yes, 65 is \x41, not \x65.)

The struct class will also conveniently handle endianness for communication or other uses.


回答 6

请注意,对于较大的值,hex()仍然可以使用(某些其他答案无效):

x = hex(349593196107334030177678842158399357)
print(x)

Python 2:0x4354467b746f6f5f736d616c6c3f7dL
Python 3:0x4354467b746f6f5f736d616c6c3f7d

对于解密的RSA消息,可以执行以下操作:

import binascii

hexadecimals = hex(349593196107334030177678842158399357)

print(binascii.unhexlify(hexadecimals[2:-1])) # python 2
print(binascii.unhexlify(hexadecimals[2:])) # python 3

Note that for large values, hex() still works (some other answers don’t):

x = hex(349593196107334030177678842158399357)
print(x)

Python 2: 0x4354467b746f6f5f736d616c6c3f7dL
Python 3: 0x4354467b746f6f5f736d616c6c3f7d

For a decrypted RSA message, one could do the following:

import binascii

hexadecimals = hex(349593196107334030177678842158399357)

print(binascii.unhexlify(hexadecimals[2:-1])) # python 2
print(binascii.unhexlify(hexadecimals[2:])) # python 3

回答 7

这对我来说最好

"0x%02X" % 5  # => 0x05
"0x%02X" % 17 # => 0x11

如果您想要一个更大的宽度(2是2个十六进制打印字符),请更改(2),这样3将为您提供以下内容

"0x%03X" % 5  # => 0x005
"0x%03X" % 17 # => 0x011

This worked best for me

"0x%02X" % 5  # => 0x05
"0x%02X" % 17 # => 0x11

Change the (2) if you want a number with a bigger width (2 is for 2 hex printned chars) so 3 will give you the following

"0x%03X" % 5  # => 0x005
"0x%03X" % 17 # => 0x011

回答 8

我希望将一个随机整数转换为以#开头的六位十六进制字符串。为了得到这个我用了

"#%6x" % random.randint(0xFFFFFF)

I wanted a random integer converted into a six-digit hex string with a # at the beginning. To get this I used

"#%6x" % random.randint(0xFFFFFF)

回答 9

随着format(),按照格式的例子,我们可以这样做:

>>> # format also supports binary numbers
>>> "int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)
'int: 42;  hex: 2a;  oct: 52;  bin: 101010'
>>> # with 0x, 0o, or 0b as prefix:
>>> "int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}".format(42)
'int: 42;  hex: 0x2a;  oct: 0o52;  bin: 0b101010'

With format(), as per format-examples, we can do:

>>> # format also supports binary numbers
>>> "int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)
'int: 42;  hex: 2a;  oct: 52;  bin: 101010'
>>> # with 0x, 0o, or 0b as prefix:
>>> "int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}".format(42)
'int: 42;  hex: 0x2a;  oct: 0o52;  bin: 0b101010'

回答 10

(int_variable).to_bytes(bytes_length, byteorder='big'|'little').hex()

例如:

>>> (434).to_bytes(4, byteorder='big').hex()
'000001b2'
>>> (434).to_bytes(4, byteorder='little').hex()
'b2010000'
(int_variable).to_bytes(bytes_length, byteorder='big'|'little').hex()

For example:

>>> (434).to_bytes(4, byteorder='big').hex()
'000001b2'
>>> (434).to_bytes(4, byteorder='little').hex()
'b2010000'

回答 11

您也可以将任何基数的任何数字转换为十六进制。在这里使用这一行代码很容易使用:

hex(int(n,x)).replace("0x","")

您有一个字符串n,该字符串是您的数字以及x 该数字的基数。首先,将其更改为整数,然后更改为十六进制,但是十六进制首先更改为十六进制0x,因此replace我们将其删除。

Also you can convert any number in any base to hex. Use this one line code here it’s easy and simple to use:

hex(int(n,x)).replace("0x","")

You have a string n that is your number and x the base of that number. First, change it to integer and then to hex but hex has 0x at the first of it so with replace we remove it.


回答 12

作为替代表示,您可以使用

[in] '%s' % hex(15)
[out]'0xf'

As an alternative representation you could use

[in] '%s' % hex(15)
[out]'0xf'

.join()方法到底做什么?

问题:.join()方法到底做什么?

我对Python来说还很陌生,并且完全不.join()理解所读内容是连接字符串的首选方法。

我试过了:

strid = repr(595)
print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring().join(strid)

并得到类似:

5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5

为什么会这样工作?难道不595应该自动附加正义吗?

I’m pretty new to Python and am completely confused by .join() which I have read is the preferred method for concatenating strings.

I tried:

strid = repr(595)
print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring().join(strid)

and got something like:

5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5

Why does it work like this? Shouldn’t the 595 just be automatically appended?


回答 0

仔细查看您的输出:

5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5
^                 ^                 ^

我突出显示了原始字符串的“ 5”,“ 9”,“ 5”。Python的join()方法是一个字符串的方法,而且占据了名单的事情,加入以字符串。一个更简单的示例可能有助于解释:

>>> ",".join(["a", "b", "c"])
'a,b,c'

在给定列表的每个元素之间插入“,”。在您的情况下,您的“列表”是字符串表示形式“ 595”,它被视为列表[“ 5”,“ 9”,“ 5”]。

看来您要寻找的是+

print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
.tostring() + strid

Look carefully at your output:

5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5
^                 ^                 ^

I’ve highlighted the “5”, “9”, “5” of your original string. The Python join() method is a string method, and takes a list of things to join with the string. A simpler example might help explain:

>>> ",".join(["a", "b", "c"])
'a,b,c'

The “,” is inserted between each element of the given list. In your case, your “list” is the string representation “595”, which is treated as the list [“5”, “9”, “5”].

It appears that you’re looking for + instead:

print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
.tostring() + strid

回答 1

join以一个可迭代的东西作为参数。通常是列表。您遇到的问题是字符串本身是可迭代的,从而依次给出每个字符。您的代码分解为:

"wlfgALGbXOahekxSs".join("595")

其作用与此相同:

"wlfgALGbXOahekxSs".join(["5", "9", "5"])

这样就产生了你的字符串:

"5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5"

字符串作为可迭代对象是Python最令人困惑的开始问题之一。

join takes an iterable thing as an argument. Usually it’s a list. The problem in your case is that a string is itself iterable, giving out each character in turn. Your code breaks down to this:

"wlfgALGbXOahekxSs".join("595")

which acts the same as this:

"wlfgALGbXOahekxSs".join(["5", "9", "5"])

and so produces your string:

"5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5"

Strings as iterables is one of the most confusing beginning issues with Python.


回答 2

要附加字符串,只需将其与+符号连接。

例如

>>> a = "Hello, "
>>> b = "world"
>>> str = a + b
>>> print str
Hello, world

join将字符串与分隔符连接在一起。分隔符是您在之前放置的内容join。例如

>>> "-".join([a,b])
'Hello, -world'

Join将字符串列表作为参数。

To append a string, just concatenate it with the + sign.

E.g.

>>> a = "Hello, "
>>> b = "world"
>>> str = a + b
>>> print str
Hello, world

join connects strings together with a separator. The separator is what you place right before the join. E.g.

>>> "-".join([a,b])
'Hello, -world'

Join takes a list of strings as a parameter.


回答 3

join()用于连接所有列表元素。仅连接两个字符串“ +”将更有意义:

strid = repr(595)
print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring() + strid

join() is for concatenating all list elements. For concatenating just two strings “+” would make more sense:

strid = repr(595)
print array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring() + strid

回答 4

为了进一步说明别人在说什么,如果您想使用join简单地连接两个字符串,则可以执行以下操作:

strid = repr(595)
print ''.join([array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring(), strid])

To expand a bit more on what others are saying, if you wanted to use join to simply concatenate your two strings, you would do this:

strid = repr(595)
print ''.join([array.array('c', random.sample(string.ascii_letters, 20 - len(strid)))
    .tostring(), strid])

回答 5

很好的解释了为什么在这里+串联大量字符串的开销很大

加号运算符是连接两个 Python字符串的完美解决方案。但是,如果您继续添加两个以上的字符串(n> 25),则可能需要考虑其他问题。

''.join([a, b, c]) 技巧是性能优化。

There is a good explanation of why it is costly to use + for concatenating a large number of strings here

Plus operator is perfectly fine solution to concatenate two Python strings. But if you keep adding more than two strings (n > 25) , you might want to think something else.

''.join([a, b, c]) trick is a performance optimization.


回答 6

在提供此作为输入时,

li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
s = ";".join(li)
print(s)

Python将其作为输出返回:

'server=mpilgrim;uid=sa;database=master;pwd=secret'

On providing this as input ,

li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
s = ";".join(li)
print(s)

Python returns this as output :

'server=mpilgrim;uid=sa;database=master;pwd=secret'

回答 7

list = ["my", "name", "is", "kourosh"]   
" ".join(list)

如果这是输入,则可以使用JOIN方法添加单词之间的距离,并将列表转换为字符串。

这是Python的输出

'my name is kourosh'
list = ["my", "name", "is", "kourosh"]   
" ".join(list)

If this is an input, using the JOIN method, we can add the distance between the words and also convert the list to the string.

This is Python output

'my name is kourosh'

回答 8

“” .join可用于将列表中的字符串复制到变量

>>> myList = list("Hello World")
>>> myString = "".join(myList)
>>> print(myList)
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
>>> print(myString)
Hello World

“”.join may be used to copy the string in a list to a variable

>>> myList = list("Hello World")
>>> myString = "".join(myList)
>>> print(myList)
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
>>> print(myString)
Hello World

字符串和字节字符串有什么区别?

问题:字符串和字节字符串有什么区别?

我正在使用一个返回字节字符串的库,我需要将其转换为字符串。

尽管我不确定有什么区别-如果有的话。

I am working with a library which returns a byte string and I need to convert this to a string.

Although I’m not sure what the difference is – if any.


回答 0

假设使用Python 3(在Python 2中,这种区别的定义不太明确)-字符串是字符序列,即unicode码点;这些是一个抽象概念,不能直接存储在磁盘上。毫无疑问,字节字符串是字节序列,可以存储在磁盘上。它们之间的映射是一种编码 -其中有很多(并且无限可能)-并且您需要知道在特定情况下哪种适用才能进行转换,因为不同的编码可能会映射相同的字节到另一个字符串:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

一旦知道要使用哪个.decode()字符串,就可以使用字节字符串的方法从中获取正确的字符串,如上所述。为了完整起见,.encode()字符串的方法是相反的:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

Assuming Python 3 (in Python 2, this difference is a little less well-defined) – a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can’t be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes – things that can be stored on disk. The mapping between them is an encoding – there are quite a lot of these (and infinitely many are possible) – and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

回答 1

计算机唯一可以存储的是字节。

要将任何内容存储在计算机中,必须先对其进行编码,即将其转换为字节。例如:

  • 如果你想存储的音乐,你必须先进行编码使用它MP3WAV等等。
  • 如果你想存储图片,必须先进行编码使用它PNGJPEG等等。
  • 如果你想存储文本,必须先进行编码使用它ASCIIUTF-8等等。

MP3WAVPNGJPEGASCIIUTF-8是的示例编码。编码是一种格式,以字节为单位表示音频,图像,文本等。

在Python中,字节字符串就是这样:字节序列。这不是人类可读的。在引擎盖下,必须先将所有内容转换为字节字符串,然后才能将其存储在计算机中。

另一方面,通常被称为“字符串”的字符串是字符序列。它是人类可读的。字符串不能直接存储在计算机中,必须先进行编码(转换为字节字符串)。可以通过多种编码将字符串转换为字节字符串,例如ASCIIUTF-8

'I am a string'.encode('ASCII')

上面的Python代码将'I am a string'使用encoding 对字符串进行编码ASCII。上面代码的结果将是一个字节字符串。如果您打印它,Python会将其表示为b'I am a string'。但是请记住,字节字符串不是人类可读的,只是Python从ASCII打印时就对其进行解码。在Python中,字节串由表示b,后跟字节串的ASCII表示。

如果您知道用于编码的字节,则可以将字节字符串解码回字符串。

b'I am a string'.decode('ASCII')

上面的代码将返回原始字符串'I am a string'

编码和解码是相反的操作。在将所有内容写入磁盘之前,必须对其进行编码,并且必须对其进行解码,然后才能被人类读取。

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

  • If you want to store music, you must first encode it using MP3, WAV, etc.
  • If you want to store a picture, you must first encode it using PNG, JPEG, etc.
  • If you want to store text, you must first encode it using ASCII, UTF-8, etc.

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.

In Python, a byte string is just that: a sequence of bytes. It isn’t human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

On the other hand, a character string, often just called a “string”, is a sequence of characters. It is human-readable. A character string can’t be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

'I am a string'.encode('ASCII')

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren’t human-readable, it’s just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string’s ASCII representation.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

b'I am a string'.decode('ASCII')

The above code will return the original string 'I am a string'.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.


回答 2

注意:由于Python 2的生命周期即将结束,因此我将详细说明Python 3的答案。

在Python 3中

bytes由8位无符号值str的序列组成,而由表示人类语言文字字符的Unicode代码点序列组成。

>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve

尽管bytesstr似乎相同的方式工作,他们的情况下,不与对方,即兼容,bytes并且str实例无法与像运营商一起使用>+。此外,请记住,比较bytesstr实例是否相等,即使用==,将始终计算为False即使完全相同。

>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False

处理bytesstr使用使用返回的文件时存在的另一个问题open内置函数。一方面,如果要从文件读取二进制数据或从文件读取二进制数据,请始终使用“ rb”或“ wb”之类的二进制模式打开文件。另一方面,如果要从文件读取Unicode数据或从文件读取Unicode数据,请注意计算机的默认编码,因此如有必要,请传递encoding参数以避免意外情况。

在Python 2中

str由8位值unicode的序列组成,而由Unicode字符序列组成。有一点要记住的是,strunicode如果str仅由7位ASCI字符组成可以与运算符一起使用。

这可能是使用辅助功能之间进行转换有用的strunicode在Python 2之间,以及bytesstr在Python 3。

Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.

In Python 3

bytes consists of sequences of 8-bit unsigned values, while str consists of sequences of Unicode code points that represent textual characters from human languages.

>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve

Even though bytes and str seem to work the same way, their instances are not compatible with each other, i.e, bytes and str instances can’t be used together with operators like > and +. In addition, keep in mind that comparing bytes and str instances for equality, i.e. using ==, will always evaluate to False even when they contain exactly the same characters.

>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False

Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like ‘rb’ or ‘wb’. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding parameter to avoid surprises.

In Python 2

str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.

It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.


回答 3

什么是Unicode

从根本上讲,计算机只处理数字。他们通过为每个字母分配一个数字来存储字母和其他字符。

……

无论平台是什么,程序是什么,语言是什么,Unicode都会为每个字符提供唯一的数字。

因此,当计算机表示字符串时,它会通过其唯一的Unicode数字找到存储在字符串计算机中的字符,并将这些数字存储在内存中。但是您不能直接将字符串写到磁盘或通过其唯一的Unicode数字在网络上传输字符串,因为这些数字只是简单的十进制数字。您应该将字符串编码为字节字符串,例如UTF-8UTF-8是一种字符编码,能够对所有可能的字符进行编码,并且将字符存储为字节(看起来像这样)。因此,已编码的字符串可以在任何地方使用,因为UTF-8几乎在任何地方都支持。当您打开一个以UTF-8在其他系统上,您的计算机将对其进行解码,并通过其唯一的Unicode数字在其中显示字符。当浏览器接收UTF-8到从网络编码的字符串数据时,它将解码数据为字符串(假设浏览器已UTF-8编码)并显示该字符串。

在python3中,您可以将字符串和字节字符串彼此转换:

>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文 

简而言之,字符串用于显示给人类在计算机上阅读,字节字符串用于存储到磁盘和数据传输。

From What is Unicode:

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.

……

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can’t directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number. When a browser receive string data encoded UTF-8 from network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.

In python3, you can transform string and byte string to each other:

>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文 

In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.


回答 4

Unicode是一种公认​​的格式,用于字符的二进制表示和各种格式(例如,小写/大写,换行,回车)和其他“事物”(例如,表情符号)。无论是在内存中还是在文件中,计算机都能够存储unicode表示(一系列位),而不是存储ascii表示(一系列不同的位)或任何其他表示形式(一系列的位) )。

为了进行通讯通讯双方必须就将使用哪种表示形式达成一致。

因为unicode试图代表所有人与人之间和计算机间通信中使用的可能的字符(和其他“事物”),所以与许多其他表示系统相比,表示许多字符(或事物)所需要的位数更多。试图代表一组更有限的字符/事物。为了“简化”,并可能适应历史用法,unicode表示几乎专门转换为某种其他表示系统(例如ascii),目的是将字符存储在文件中。

这不是的情况下的unicode 不能被用于在文件中存储的字符,或通过发送它们的任何通信信道,只要它不。

术语“字符串”没有精确定义。通常,“字符串”是指一组字符/事物。在计算机中,这些字符可以以多种不同的逐位表示形式中的任何一种形式存储。“字节字符串”是一组字符,它们使用八位(八位称为字节)的表示形式存储。由于如今,计算机使用unicode系统(由可变字节数表示的字符)将字符存储在内存中,并使用字节字符串(由单字节表示的字符)将字符存储到文件中,因此在表示字符之前必须先进行转换内存中的内容将被移动到文件存储中。

Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (e.g. lower case/upper case, new line, carriage return), and other “things” (e.g. emojis). A computer is no less capable of storing a unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ascii representation (a different series of bits), or any other representation (series of bits).

For communication to take place, the parties to the communication must agree on what representation will be used.

Because unicode seeks to represent all the possible characters (and other “things”) used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things. To “simplify,” and perhaps to accommodate historical usage, unicode representation is almost exclusively converted to some other system of representation (e.g. ascii) for the purpose of storing characters in files.

It is not the case that unicode cannot be used for storing characters in files, or transmitting them through any communications channel, simply that it is not.

The term “string,” is not precisely defined. “String,” in its common usage, refers to a set of characters/things. In a computer, those characters may be stored in any one of many different bit-by-bit representations. A “byte string” is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). Since, these days, computers use the unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.


回答 5

让我们有一个简单的单字符字符串,'š'并将其编码为字节序列:

>>> 'š'.encode('utf-8')
b'\xc5\xa1'

出于本示例的目的,让我们以二进制形式显示字节序列:

>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'

现在,在不知道信息是如何编码的情况下,通常无法将信息解码回去。仅当您知道使用了utf-8文本编码时,您才可以按照用于解码utf-8算法并获取原始字符串:

11000101 10100001
   ^^^^^   ^^^^^^
   00101   100001

您可以将二进制数显示101100001为字符串:

>>> chr(int('101100001', 2))
'š'

Let’s have a simple one-character string 'š' and encode it into a sequence of bytes:

>>> 'š'.encode('utf-8')
b'\xc5\xa1'

For the purpose of this example let’s display the sequence of bytes in its binary form:

>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'

Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the utf-8 text encoding was used, you can follow the algorithm for decoding utf-8 and acquire the original string:

11000101 10100001
   ^^^^^   ^^^^^^
   00101   100001

You can display the binary number 101100001 back as a string:

>>> chr(int('101100001', 2))
'š'

回答 6

Python语言包括strbytes作为标准的“内置类型”。换句话说,它们都是类。我认为尝试合理化以这种方式实现Python的理由并不值得。

话虽如此,str而且bytes彼此非常相似。两者共享大多数相同的方法。以下方法是str该类唯一的:

casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable

以下方法是bytes该类唯一的:

decode
fromhex
hex

The Python languages includes str and bytes as standard “Built-in Types”. In other words, they are both classes. I don’t think it’s worthwhile trying to rationalize why Python has been implemented this way.

Having said that, str and bytes are very similar to one another. Both share most of the same methods. The following methods are unique to the str class:

casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable

The following methods are unique to the bytes class:

decode
fromhex
hex

重复字符串一定长度

问题:重复字符串一定长度

将字符串重复到一定长度的有效方法是什么?例如:repeat('abc', 7) -> 'abcabca'

这是我当前的代码:

def repeat(string, length):
    cur, old = 1, string
    while len(string) < length:
        string += old[cur-1]
        cur = (cur+1)%len(old)
    return string

有没有更好的方法(更pythonic)来做到这一点?也许使用列表推导?

What is an efficient way to repeat a string to a certain length? Eg: repeat('abc', 7) -> 'abcabca'

Here is my current code:

def repeat(string, length):
    cur, old = 1, string
    while len(string) < length:
        string += old[cur-1]
        cur = (cur+1)%len(old)
    return string

Is there a better (more pythonic) way to do this? Maybe using list comprehension?


回答 0

def repeat_to_length(string_to_expand, length):
   return (string_to_expand * ((length/len(string_to_expand))+1))[:length]

对于python3:

def repeat_to_length(string_to_expand, length):
    return (string_to_expand * (int(length/len(string_to_expand))+1))[:length]
def repeat_to_length(string_to_expand, length):
   return (string_to_expand * ((length/len(string_to_expand))+1))[:length]

For python3:

def repeat_to_length(string_to_expand, length):
    return (string_to_expand * (int(length/len(string_to_expand))+1))[:length]

回答 1

杰森·谢尼尔(Jason Scheirer)的回答是正确的,但可以使用更多的阐述。

首先,要将字符串重复整数次,可以使用重载乘法:

>>> 'abc' * 7
'abcabcabcabcabcabcabc'

因此,要重复一个字符串直到其长度至少等于所需长度,您需要计算适当的重复次数并将其放在该乘法运算符的右侧:

def repeat_to_at_least_length(s, wanted):
    return s * (wanted//len(s) + 1)

>>> repeat_to_at_least_length('abc', 7)
'abcabcabc'

然后,可以使用数组切片将其修剪为所需的确切长度:

def repeat_to_length(s, wanted):
    return (s * (wanted//len(s) + 1))[:wanted]

>>> repeat_to_length('abc', 7)
'abcabca'

另外,如pillmod的答案所建议的那样,可能没有人向下滚动到足以引起注意的程度,您可以divmod用来一次计算所需的完整重复次数和多余字符的数量:

def pillmod_repeat_to_length(s, wanted):
    a, b = divmod(wanted, len(s))
    return s * a + s[:b]

哪个更好?让我们对其进行基准测试:

>>> import timeit
>>> timeit.repeat('scheirer_repeat_to_length("abcdefg", 129)', globals=globals())
[0.3964178159367293, 0.32557755894958973, 0.32851039397064596]
>>> timeit.repeat('pillmod_repeat_to_length("abcdefg", 129)', globals=globals())
[0.5276265419088304, 0.46511475392617285, 0.46291469305288047]

因此,pillmod的版本要慢40%,这太糟糕了,因为我个人认为它的可读性更高。造成这种情况的原因可能有多种,首先是将其编译为大约40%以上的字节码指令。

注意:这些示例使用new-ish //运算符截断整数除法。这通常被称为 Python 3功能,但是根据PEP 238,它是从Python 2.2一直引入的。你只拥有在Python 3(或在具有模块使用它from __future__ import division),但你可以不考虑使用它。

Jason Scheirer’s answer is correct but could use some more exposition.

First off, to repeat a string an integer number of times, you can use overloaded multiplication:

>>> 'abc' * 7
'abcabcabcabcabcabcabc'

So, to repeat a string until it’s at least as long as the length you want, you calculate the appropriate number of repeats and put it on the right-hand side of that multiplication operator:

def repeat_to_at_least_length(s, wanted):
    return s * (wanted//len(s) + 1)

>>> repeat_to_at_least_length('abc', 7)
'abcabcabc'

Then, you can trim it to the exact length you want with an array slice:

def repeat_to_length(s, wanted):
    return (s * (wanted//len(s) + 1))[:wanted]

>>> repeat_to_length('abc', 7)
'abcabca'

Alternatively, as suggested in pillmod’s answer that probably nobody scrolls down far enough to notice anymore, you can use divmod to compute the number of full repetitions needed, and the number of extra characters, all at once:

def pillmod_repeat_to_length(s, wanted):
    a, b = divmod(wanted, len(s))
    return s * a + s[:b]

Which is better? Let’s benchmark it:

>>> import timeit
>>> timeit.repeat('scheirer_repeat_to_length("abcdefg", 129)', globals=globals())
[0.3964178159367293, 0.32557755894958973, 0.32851039397064596]
>>> timeit.repeat('pillmod_repeat_to_length("abcdefg", 129)', globals=globals())
[0.5276265419088304, 0.46511475392617285, 0.46291469305288047]

So, pillmod’s version is something like 40% slower, which is too bad, since personally I think it’s much more readable. There are several possible reasons for this, starting with its compiling to about 40% more bytecode instructions.

Note: these examples use the new-ish // operator for truncating integer division. This is often called a Python 3 feature, but according to PEP 238, it was introduced all the way back in Python 2.2. You only have to use it in Python 3 (or in modules that have from __future__ import division) but you can use it regardless.


回答 2

这是相当Pythonic的:

newstring = 'abc'*5
print newstring[0:6]

This is pretty pythonic:

newstring = 'abc'*5
print newstring[0:6]

回答 3

def rep(s, m):
    a, b = divmod(m, len(s))
    return s * a + s[:b]
def rep(s, m):
    a, b = divmod(m, len(s))
    return s * a + s[:b]

回答 4

from itertools import cycle, islice
def srepeat(string, n):
   return ''.join(islice(cycle(string), n))
from itertools import cycle, islice
def srepeat(string, n):
   return ''.join(islice(cycle(string), n))

回答 5

也许不是最有效的解决方案,但肯定是简短的:

def repstr(string, length):
    return (string * length)[0:length]

repstr("foobar", 14)

给出“ foobarfoobarfo”。关于此版本的一件事是,如果长度<len(string),则输出字符串将被截断。例如:

repstr("foobar", 3)

给出“ foo”。

编辑:实际上,令我惊讶的是,这比当前接受的解决方案(“ repeat_to_length”函数)要快,至少在短字符串上如此:

from timeit import Timer
t1 = Timer("repstr('foofoo', 30)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo', 30)", 'from __main__ import repeat_to_length')
t1.timeit()  # gives ~0.35 secs
t2.timeit()  # gives ~0.43 secs

据推测,如果字符串很长或长度很长(也就是说,如果该string * length部分的浪费很高),则它的性能会很差。实际上,我们可以修改上面的内容以验证这一点:

from timeit import Timer
t1 = Timer("repstr('foofoo' * 10, 3000)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo' * 10, 3000)", 'from __main__ import repeat_to_length')
t1.timeit()  # gives ~18.85 secs
t2.timeit()  # gives ~1.13 secs

Perhaps not the most efficient solution, but certainly short & simple:

def repstr(string, length):
    return (string * length)[0:length]

repstr("foobar", 14)

Gives “foobarfoobarfo”. One thing about this version is that if length < len(string) then the output string will be truncated. For example:

repstr("foobar", 3)

Gives “foo”.

Edit: actually to my surprise, this is faster than the currently accepted solution (the ‘repeat_to_length’ function), at least on short strings:

from timeit import Timer
t1 = Timer("repstr('foofoo', 30)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo', 30)", 'from __main__ import repeat_to_length')
t1.timeit()  # gives ~0.35 secs
t2.timeit()  # gives ~0.43 secs

Presumably if the string was long, or length was very high (that is, if the wastefulness of the string * length part was high) then it would perform poorly. And in fact we can modify the above to verify this:

from timeit import Timer
t1 = Timer("repstr('foofoo' * 10, 3000)", 'from __main__ import repstr')
t2 = Timer("repeat_to_length('foofoo' * 10, 3000)", 'from __main__ import repeat_to_length')
t1.timeit()  # gives ~18.85 secs
t2.timeit()  # gives ~1.13 secs

回答 6

怎么样 string * (length / len(string)) + string[0:(length % len(string))]

How about string * (length / len(string)) + string[0:(length % len(string))]


回答 7

我用这个:

def extend_string(s, l):
    return (s*l)[:l]

i use this:

def extend_string(s, l):
    return (s*l)[:l]

回答 8

并不是说这个问题没有足够的答案,而是有一个重复功能。只需列出一个清单,然后加入输出:

from itertools import repeat

def rep(s,n):
  ''.join(list(repeat(s,n))

Not that there haven’t been enough answers to this question, but there is a repeat function; just need to make a list of and then join the output:

from itertools import repeat

def rep(s,n):
  ''.join(list(repeat(s,n))

回答 9

耶递归!

def trunc(s,l):
    if l > 0:
        return s[:l] + trunc(s, l - len(s))
    return ''

不会永远缩放,但是对于较小的字符串来说很好。很漂亮

我承认我刚刚读了Little Schemer,并且现在喜欢递归。

Yay recursion!

def trunc(s,l):
    if l > 0:
        return s[:l] + trunc(s, l - len(s))
    return ''

Won’t scale forever, but it’s fine for smaller strings. And it’s pretty.

I admit I just read the Little Schemer and I like recursion right now.


回答 10

这是使用列表理解的一种方法,尽管随着rpt字符串长度的增加,它的浪费越来越大。

def repeat(rpt, length):
    return ''.join([rpt for x in range(0, (len(rpt) % length))])[:length]

This is one way to do it using a list comprehension, though it’s increasingly wasteful as the length of the rpt string increases.

def repeat(rpt, length):
    return ''.join([rpt for x in range(0, (len(rpt) % length))])[:length]

回答 11

另一个FP方法:

def repeat_string(string_to_repeat, repetitions):
    return ''.join([ string_to_repeat for n in range(repetitions)])

Another FP aproach:

def repeat_string(string_to_repeat, repetitions):
    return ''.join([ string_to_repeat for n in range(repetitions)])

回答 12

def extended_string (word, length) :

    extra_long_word = word * (length//len(word) + 1)
    required_string = extra_long_word[:length]
    return required_string

print(extended_string("abc", 7))
def extended_string (word, length) :

    extra_long_word = word * (length//len(word) + 1)
    required_string = extra_long_word[:length]
    return required_string

print(extended_string("abc", 7))