标签归档:string

如何确定子字符串是否在其他字符串中

问题:如何确定子字符串是否在其他字符串中

我有一个子字符串:

substring = "please help me out"

我还有另一个字符串:

string = "please help me out so that I could solve this"

如何查找是否substringstring使用Python 的子集?

I have a sub-string:

substring = "please help me out"

I have another string:

string = "please help me out so that I could solve this"

How do I find if substring is a subset of string using Python?


回答 0

insubstring in string

>>> substring = "please help me out"
>>> string = "please help me out so that I could solve this"
>>> substring in string
True

with in: substring in string:

>>> substring = "please help me out"
>>> string = "please help me out so that I could solve this"
>>> substring in string
True

回答 1

foo = "blahblahblah"
bar = "somethingblahblahblahmeep"
if foo in bar:
    # do something

(顺便说一句-尝试不命名变量string,因为存在一个具有相同名称的Python标准库。如果在大型项目中这样做,可能会使人们感到困惑,因此避免这种冲突是一种很好的习惯。)

foo = "blahblahblah"
bar = "somethingblahblahblahmeep"
if foo in bar:
    # do something

(By the way – try to not name a variable string, since there’s a Python standard library with the same name. You might confuse people if you do that in a large project, so avoiding collisions like that is a good habit to get into.)


回答 2

如果您要寻找的不是True / False,那么最适合使用re模块,例如:

import re
search="please help me out"
fullstring="please help me out so that I could solve this"
s = re.search(search,fullstring)
print(s.group())

s.group() 将返回字符串“请帮助我”。

If you’re looking for more than a True/False, you’d be best suited to use the re module, like:

import re
search="please help me out"
fullstring="please help me out so that I could solve this"
s = re.search(search,fullstring)
print(s.group())

s.group() will return the string “please help me out”.


回答 3

我想我会在您不希望使用Python的内置函数inor的情况下进行技术面试时添加此内容find,这很可怕,但是确实发生了:

string = "Samantha"
word = "man"

def find_sub_string(word, string):
  len_word = len(word)  #returns 3

  for i in range(len(string)-1):
    if string[i: i + len_word] == word:
  return True

  else:
    return False

Thought I would add this in case you are looking at how to do this for a technical interview where they don’t want you to use Python’s built-in function in or find, which is horrible, but does happen:

string = "Samantha"
word = "man"

def find_sub_string(word, string):
  len_word = len(word)  #returns 3

  for i in range(len(string)-1):
    if string[i: i + len_word] == word:
  return True

  else:
    return False

回答 4

人们提到了string.find()string.index(),并string.indexOf()在评论中进行了总结(根据Python文档):

首先没有string.indexOf()方法。Deviljho发布的链接显示这是一个JavaScript函数。

其次,string.find()string.index()实际上返回子字符串的索引。唯一的区别是它们如何处理子字符串没有发现的情况:string.find()回报-1string.index()引发ValueError

People mentioned string.find(), string.index(), and string.indexOf() in the comments, and I summarize them here (according to the Python Documentation):

First of all there is not a string.indexOf() method. The link posted by Deviljho shows this is a JavaScript function.

Second the string.find() and string.index() actually return the index of a substring. The only difference is how they handle the substring not found situation: string.find() returns -1 while string.index() raises an ValueError.


回答 5

您也可以尝试find()方法。它确定字符串str是出现在字符串中还是出现在字符串的子字符串中。

str1 = "please help me out so that I could solve this"
str2 = "please help me out"

if (str1.find(str2)>=0):
  print("True")
else:
  print ("False")

You can also try find() method. It determines if string str occurs in string, or in a substring of string.

str1 = "please help me out so that I could solve this"
str2 = "please help me out"

if (str1.find(str2)>=0):
  print("True")
else:
  print ("False")

回答 6

In [7]: substring = "please help me out"

In [8]: string = "please help me out so that I could solve this"

In [9]: substring in string
Out[9]: True
In [7]: substring = "please help me out"

In [8]: string = "please help me out so that I could solve this"

In [9]: substring in string
Out[9]: True

回答 7

def find_substring():
    s = 'bobobnnnnbobmmmbosssbob'
    cnt = 0
    for i in range(len(s)):
        if s[i:i+3] == 'bob':
            cnt += 1
    print 'bob found: ' + str(cnt)
    return cnt

def main():
    print(find_substring())

main()
def find_substring():
    s = 'bobobnnnnbobmmmbosssbob'
    cnt = 0
    for i in range(len(s)):
        if s[i:i+3] == 'bob':
            cnt += 1
    print 'bob found: ' + str(cnt)
    return cnt

def main():
    print(find_substring())

main()

回答 8

也可以用这种方法

if substring in string:
    print(string + '\n Yes located at:'.format(string.find(substring)))

Can also use this method

if substring in string:
    print(string + '\n Yes located at:'.format(string.find(substring)))

回答 9

在此处输入图片说明

除了使用find()之外,一种简单的方法是如上所述使用’in’。

如果’str’中存在’substring’,则将执行part,否则执行part。

enter image description here

Instead Of using find(), One of the easy way is the Use of ‘in’ as above.

if ‘substring’ is present in ‘str’ then if part will execute otherwise else part will execute.


如何在Python中拆分和解析字符串?

问题:如何在Python中拆分和解析字符串?

我正在尝试在python中拆分此字符串: 2.7.0_bf4fda703454

我想在下划线处拆分该字符串,_以便可以使用左侧的值。

I am trying to split this string in python: 2.7.0_bf4fda703454

I want to split that string on the underscore _ so that I can use the value on the left side.


回答 0

"2.7.0_bf4fda703454".split("_") 给出字符串列表:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

这会在每个下划线处拆分字符串。如果要在第一次拆分后停止它,请使用"2.7.0_bf4fda703454".split("_", 1)

如果您知道字符串中包含下划线,那么您甚至可以将LHS和RHS解压缩为单独的变量:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

另一种方法是使用partition()。用法与上一个示例类似,不同之处在于它返回三个组件而不是两个。主要优点是,如果字符串不包含分隔符,则此方法不会失败。

"2.7.0_bf4fda703454".split("_") gives a list of strings:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

This splits the string at every underscore. If you want it to stop after the first split, use "2.7.0_bf4fda703454".split("_", 1).

If you know for a fact that the string contains an underscore, you can even unpack the LHS and RHS into separate variables:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

An alternative is to use partition(). The usage is similar to the last example, except that it returns three components instead of two. The principal advantage is that this method doesn’t fail if the string doesn’t contain the separator.


回答 1

Python字符串解析演练

在空间上分割字符串,获取列表,显示其类型,然后打印出来:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

如果两个相邻的定界符相邻,则假定为空字符串:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

在下划线处分割字符串,并获取列表中的第五项:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

将多个空格合为一个

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

当您不向Python的split方法传递任何参数时,文档指出:“连续的空白行被视为单个分隔符,如果字符串的开头或结尾是空白,则结果在开头或结尾将不包含空字符串”。

抓住男孩们的帽子,解析一个正则表达式:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

正则表达式“[AM] +”装置中的小写字母a通过m发生一次或多次被作为分隔符相匹配。re是要导入的库。

或者,如果您想一次将一项剁碎:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

Python string parsing walkthrough

Split a string on space, get a list, show its type, print it out:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

If you have two delimiters next to each other, empty string is assumed:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

Split a string on underscore and grab the 5th item in the list:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

Collapse multiple spaces into one

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

When you pass no parameter to Python’s split method, the documentation states: “runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace”.

Hold onto your hats boys, parse on a regular expression:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

The regular expression “[a-m]+” means the lowercase letters a through m that occur one or more times are matched as a delimiter. re is a library to be imported.

Or if you want to chomp the items one at a time:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

回答 2

如果总是将LHS / RHS分开,则也可以使用partition字符串中内置的方法。它返回一个三元组,就(LHS, separator, RHS)好像找到了分隔符,(original_string, '', '')如果不存在分隔符:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

If it’s always going to be an even LHS/RHS split, you can also use the partition method that’s built into strings. It returns a 3-tuple as (LHS, separator, RHS) if the separator is found, and (original_string, '', '') if the separator wasn’t present:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

在python中将字符串转换为二进制

问题:在python中将字符串转换为二进制

我需要一种方法来获取python中字符串的二进制表示形式。例如

st = "hello world"
toBinary(st)

是否有一些巧妙的方法来做到这一点?

I am in need of a way to get the binary representation of a string in python. e.g.

st = "hello world"
toBinary(st)

Is there a module of some neat way of doing this?


回答 0

像这样吗

>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

Something like this?

>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

回答 1

作为一种更pythonic的方式,您可以先将字符串转换为字节数组,然后在其中使用binfunction map

>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']

或者您可以加入它:

>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

请注意,在python3中,您需要为bytearrayfunction 指定编码:

>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

您也可以binascii在python 2中使用模块:

>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'

hexlify返回二进制数据的十六进制表示形式,然后可以通过将16指定为基数将其转换为int,然后使用转换为int bin

As a more pythonic way you can first convert your string to byte array then use bin function within map :

>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']

Or you can join it:

>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

Note that in python3 you need to specify an encoding for bytearray function :

>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

You can also use binascii module in python 2:

>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'

hexlify return the hexadecimal representation of the binary data then you can convert to int by specifying 16 as its base then convert it to binary with bin.


回答 2

我们只需要对其编码。

'string'.encode('ascii')

We just need to encode it.

'string'.encode('ascii')

回答 3

您可以使用ord()内置函数访问字符串中字符的代码值。如果然后需要以二进制格式设置此格式,则该string.format()方法将完成此工作。

a = "test"
print(' '.join(format(ord(x), 'b') for x in a))

(感谢Ashwini Chaudhary发布了该代码段。)

尽管以上代码在Python 3中有效,但是如果您假设使用除UTF-8之外的任何其他编码,则此问题将变得更加复杂。在Python 2中,字符串是字节序列,默认情况下采用ASCII编码。在Python 3中,字符串被假定为Unicode,并且还有一个单独的bytes类型,其行为更像Python 2字符串。如果您希望采用UTF-8以外的任何其他编码,则需要指定编码。

然后,在Python 3中,您可以执行以下操作:

a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))

对于简单的字母数字字符串,UTF-8和ascii编码之间的区别不会很明显,但是如果您要处理包含不在ascii字符集中的字符的文本,它将变得很重要。

You can access the code values for the characters in your string using the ord() built-in function. If you then need to format this in binary, the string.format() method will do the job.

a = "test"
print(' '.join(format(ord(x), 'b') for x in a))

(Thanks to Ashwini Chaudhary for posting that code snippet.)

While the above code works in Python 3, this matter gets more complicated if you’re assuming any encoding other than UTF-8. In Python 2, strings are byte sequences, and ASCII encoding is assumed by default. In Python 3, strings are assumed to be Unicode, and there’s a separate bytes type that acts more like a Python 2 string. If you wish to assume any encoding other than UTF-8, you’ll need to specify the encoding.

In Python 3, then, you can do something like this:

a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))

The differences between UTF-8 and ascii encoding won’t be obvious for simple alphanumeric strings, but will become important if you’re processing text that includes characters not in the ascii character set.


回答 4

在Python 3.6及更高版本中,您可以使用f-string格式化结果。

str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
  • 冒号的左侧ord(i)是实际对象,其值将被格式化并插入到输出中。使用ord()可为您提供单个str字符的以10为底的代码点。

  • 冒号的右侧是格式说明符。08表示宽度8,填充0,b表示输出以2为底的数字(二进制​​)的符号。

In Python version 3.6 and above you can use f-string to format result.

str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
  • The left side of the colon, ord(i), is the actual object whose value will be formatted and inserted into the output. Using ord() gives you the base-10 code point for a single str character.

  • The right hand side of the colon is the format specifier. 08 means width 8, 0 padded, and the b functions as a sign to output the resulting number in base 2 (binary).


回答 5

这是对现有答案的更新,该答案已使用bytearray()并且无法再以这种方式工作:

>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

因为,如上面的链接所述,如果源是字符串,则 还必须提供编码

>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>

This is an update for the existing answers which used bytearray() and can not work that way anymore:

>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

Because, as explained in the link above, if the source is a string, you must also give the encoding:

>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>

回答 6

def method_a(sample_string):
    binary = ' '.join(format(ord(x), 'b') for x in sample_string)

def method_b(sample_string):
    binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))


if __name__ == '__main__':

    from timeit import timeit

    sample_string = 'Convert this ascii strong to binary.'

    print(
        timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
        timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
    )

# 9.564299999998184 2.943955828988692

method_b转换为字节数组的效率更高,因为它进行低级函数调用,而不是手动将每个字符转换为整数,然后将该整数转换为其二进制值。

def method_a(sample_string):
    binary = ' '.join(format(ord(x), 'b') for x in sample_string)

def method_b(sample_string):
    binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))


if __name__ == '__main__':

    from timeit import timeit

    sample_string = 'Convert this ascii strong to binary.'

    print(
        timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
        timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
    )

# 9.564299999998184 2.943955828988692

method_b is substantially more efficient at converting to a byte array because it makes low level function calls instead of manually transforming every character to an integer, and then converting that integer into its binary value.


回答 7

a = list(input("Enter a string\t: "))
def fun(a):
    c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
    return c
print(fun(a))
a = list(input("Enter a string\t: "))
def fun(a):
    c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
    return c
print(fun(a))

将pandas数据框中的列从int转换为string

问题:将pandas数据框中的列从int转换为string

我在pandas中有一个数据帧,其中包含int和str数据列。我想先串联数据框内的列。为此,我必须将int列转换为str。我尝试做如下:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

要么

mtrx['X.3'] = mtrx['X.3'].astype(str)

但是在两种情况下都无法正常工作,并且我收到一条错误消息:“无法连接’str’和’int’对象”。连接两str列效果很好。

I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int column to str. I’ve tried to do as follows:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

or

mtrx['X.3'] = mtrx['X.3'].astype(str)

but in both cases it’s not working and I’m getting an error saying “cannot concatenate ‘str’ and ‘int’ objects”. Concatenating two str columns is working perfectly fine.


回答 0

In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

转换系列

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

不要忘记将结果分配回去:

df['A'] = df['A'].apply(str)

转换整个框架

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)
In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

Convert a series

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

Don’t forget to assign the result back:

df['A'] = df['A'].apply(str)

Convert the whole frame

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)

回答 1

更改DataFrame列的数据类型:

要诠释:

df.column_name = df.column_name.astype(np.int64)

要str:

df.column_name = df.column_name.astype(str)

Change data type of DataFrame column:

To int:

df.column_name = df.column_name.astype(np.int64)

To str:

df.column_name = df.column_name.astype(str)


回答 2

警告:给定的两个解决方案 astype()和apply()都不以nan或None形式保留NULL值。

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

我相信这是由to_string()的实现解决的

Warning: Both solutions given ( astype() and apply() ) do not preserve NULL values in either the nan or the None form.

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

I believe this is fixed by the implementation of to_string()


回答 3

使用以下代码:

df.column_name = df.column_name.astype('str')

Use the following code:

df.column_name = df.column_name.astype('str')

回答 4

仅供参考。

以上所有答案均适用于数据帧的情况。但是,如果您在创建/修改列时使用lambda,则此方法将不起作用,因为在那里将其视为int属性而不是pandas系列。您必须使用str(target_attribute)使其成为字符串。请参考以下示例。

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

Just for an additional reference.

All of the above answers will work in case of a data frame. But if you are using lambda while creating / modify a column this won’t work, Because there it is considered as a int attribute instead of pandas series. You have to use str( target_attribute ) to make it as a string. Please refer the below example.

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

将十六进制转换为二进制

问题:将十六进制转换为二进制

我有ABC123EFFF。

我想拥有00101010111100000100001111110111111111111111(例如,二进制表示,具有42位数字和前导零)。

怎么样?

I have ABC123EFFF.

I want to have 001010101111000001001000111110111111111111 (i.e. binary repr. with, say, 42 digits and leading zeroes).

How?


回答 0

为了解决左侧尾随零问题:


my_hexdata = "1a"

scale = 16 ## equals to hexadecimal

num_of_bits = 8

bin(int(my_hexdata, scale))[2:].zfill(num_of_bits)

它将给出00011010,而不是修剪后的版本。

For solving the left-side trailing zero problem:


my_hexdata = "1a"

scale = 16 ## equals to hexadecimal

num_of_bits = 8

bin(int(my_hexdata, scale))[2:].zfill(num_of_bits)

It will give 00011010 instead of the trimmed version.


回答 1

import binascii

binary_string = binascii.unhexlify(hex_string)

双歧杆菌

返回由指定为参数的十六进制字符串表示的二进制数据。

import binascii

binary_string = binascii.unhexlify(hex_string)

Read

binascii.unhexlify

Return the binary data represented by the hexadecimal string specified as the parameter.


回答 2

bin(int("abc123efff", 16))[2:]
bin(int("abc123efff", 16))[2:]

回答 3

将十六进制转换为二进制

我有ABC123EFFF。

我想拥有001010101111000001001000111110111111111111111(即具有42位数字和前导零的二进制表示)。

简短答案:

Python 3.6中的新f字符串允许您使用非常简洁的语法来执行此操作:

>>> f'{0xABC123EFFF:0>42b}'
'001010101111000001001000111110111111111111'

或将其与语义分开:

>>> number, pad, rjust, size, kind = 0xABC123EFFF, '0', '>', 42, 'b'
>>> f'{number:{pad}{rjust}{size}{kind}}'
'001010101111000001001000111110111111111111'

长答案:

您实际上要说的是,您有一个以十六进制表示形式的值,并且您想要以二进制形式表示一个等效值。

等效值是一个整数。但是您可以以字符串开头,并且要以二进制形式查看,必须以字符串结尾。

将十六进制转换为二进制,42位数字和前导零?

我们有几种直接的方法可以实现这一目标,而无需使用切片。

首先,在我们完全不能执行任何二进制操作之前,将其转换为int(我想这是字符串格式,而不是文字格式):

>>> integer = int('ABC123EFFF', 16)
>>> integer
737679765503

或者,我们可以使用以十六进制形式表示的整数文字:

>>> integer = 0xABC123EFFF
>>> integer
737679765503

现在我们需要用二进制表示形式来表示整数。

使用内置功能, format

然后传递到format

>>> format(integer, '0>42b')
'001010101111000001001000111110111111111111'

这使用格式规范的mini-language

分解一下,这是它的语法形式:

[[fill]align][sign][#][0][width][,][.precision][type]

为了使之成为满足我们需求的规范,我们只排除不需要的东西:

>>> spec = '{fill}{align}{width}{type}'.format(fill='0', align='>', width=42, type='b')
>>> spec
'0>42b'

然后将其传递给格式

>>> bin_representation = format(integer, spec)
>>> bin_representation
'001010101111000001001000111110111111111111'
>>> print(bin_representation)
001010101111000001001000111110111111111111

字符串格式化(模板化) str.format

我们可以在使用str.format方法的字符串中使用它:

>>> 'here is the binary form: {0:{spec}}'.format(integer, spec=spec)
'here is the binary form: 001010101111000001001000111110111111111111'

或者直接将规范放在原始字符串中:

>>> 'here is the binary form: {0:0>42b}'.format(integer)
'here is the binary form: 001010101111000001001000111110111111111111'

使用新的f字符串进行字符串格式化

让我们演示新的f字符串。它们使用相同的迷你语言格式设置规则:

>>> integer = 0xABC123EFFF
>>> length = 42
>>> f'{integer:0>{length}b}'
'001010101111000001001000111110111111111111'

现在,让我们将此功能放入鼓励重复使用性的功能中:

def bin_format(integer, length):
    return f'{integer:0>{length}b}'

现在:

>>> bin_format(0xABC123EFFF, 42)
'001010101111000001001000111110111111111111'    

在旁边

如果您实际上只是想将数据编码为内存或磁盘上的字节字符串,则可以使用int.to_bytes仅在Python 3中可用的方法:

>>> help(int.to_bytes)
to_bytes(...)
    int.to_bytes(length, byteorder, *, signed=False) -> bytes
...

由于42位除以每字节8位等于6个字节,因此:

>>> integer.to_bytes(6, 'big')
b'\x00\xab\xc1#\xef\xff'

Convert hex to binary

I have ABC123EFFF.

I want to have 001010101111000001001000111110111111111111 (i.e. binary repr. with, say, 42 digits and leading zeroes).

Short answer:

The new f-strings in Python 3.6 allow you to do this using very terse syntax:

>>> f'{0xABC123EFFF:0>42b}'
'001010101111000001001000111110111111111111'

or to break that up with the semantics:

>>> number, pad, rjust, size, kind = 0xABC123EFFF, '0', '>', 42, 'b'
>>> f'{number:{pad}{rjust}{size}{kind}}'
'001010101111000001001000111110111111111111'

Long answer:

What you are actually saying is that you have a value in a hexadecimal representation, and you want to represent an equivalent value in binary.

The value of equivalence is an integer. But you may begin with a string, and to view in binary, you must end with a string.

Convert hex to binary, 42 digits and leading zeros?

We have several direct ways to accomplish this goal, without hacks using slices.

First, before we can do any binary manipulation at all, convert to int (I presume this is in a string format, not as a literal):

>>> integer = int('ABC123EFFF', 16)
>>> integer
737679765503

alternatively we could use an integer literal as expressed in hexadecimal form:

>>> integer = 0xABC123EFFF
>>> integer
737679765503

Now we need to express our integer in a binary representation.

Use the builtin function, format

Then pass to format:

>>> format(integer, '0>42b')
'001010101111000001001000111110111111111111'

This uses the formatting specification’s mini-language.

To break that down, here’s the grammar form of it:

[[fill]align][sign][#][0][width][,][.precision][type]

To make that into a specification for our needs, we just exclude the things we don’t need:

>>> spec = '{fill}{align}{width}{type}'.format(fill='0', align='>', width=42, type='b')
>>> spec
'0>42b'

and just pass that to format

>>> bin_representation = format(integer, spec)
>>> bin_representation
'001010101111000001001000111110111111111111'
>>> print(bin_representation)
001010101111000001001000111110111111111111

String Formatting (Templating) with str.format

We can use that in a string using str.format method:

>>> 'here is the binary form: {0:{spec}}'.format(integer, spec=spec)
'here is the binary form: 001010101111000001001000111110111111111111'

Or just put the spec directly in the original string:

>>> 'here is the binary form: {0:0>42b}'.format(integer)
'here is the binary form: 001010101111000001001000111110111111111111'

String Formatting with the new f-strings

Let’s demonstrate the new f-strings. They use the same mini-language formatting rules:

>>> integer = 0xABC123EFFF
>>> length = 42
>>> f'{integer:0>{length}b}'
'001010101111000001001000111110111111111111'

Now let’s put this functionality into a function to encourage reusability:

def bin_format(integer, length):
    return f'{integer:0>{length}b}'

And now:

>>> bin_format(0xABC123EFFF, 42)
'001010101111000001001000111110111111111111'    

Aside

If you actually just wanted to encode the data as a string of bytes in memory or on disk, you can use the int.to_bytes method, which is only available in Python 3:

>>> help(int.to_bytes)
to_bytes(...)
    int.to_bytes(length, byteorder, *, signed=False) -> bytes
...

And since 42 bits divided by 8 bits per byte equals 6 bytes:

>>> integer.to_bytes(6, 'big')
b'\x00\xab\xc1#\xef\xff'

回答 4

>>> bin( 0xABC123EFFF )

‘0b1010101111000000000010001000111110111111111111’

>>> bin( 0xABC123EFFF )

‘0b1010101111000001001000111110111111111111’


回答 5

"{0:020b}".format(int('ABC123EFFF', 16))
"{0:020b}".format(int('ABC123EFFF', 16))

回答 6

这是一种使用位摆弄来生成二进制字符串的相当原始的方法。

要了解的关键是:

(n & (1 << i)) and 1

如果n的第i位被设置,它将生成0或1。


import binascii

def byte_to_binary(n):
    return ''.join(str((n & (1 << i)) and 1) for i in reversed(range(8)))

def hex_to_binary(h):
    return ''.join(byte_to_binary(ord(b)) for b in binascii.unhexlify(h))

print hex_to_binary('abc123efff')

>>> 1010101111000001001000111110111111111111

编辑:使用“新的”三元运算符:

(n & (1 << i)) and 1

会成为:

1 if n & (1 << i) or 0

(我不确定是哪个TBH的可读性)

Here’s a fairly raw way to do it using bit fiddling to generate the binary strings.

The key bit to understand is:

(n & (1 << i)) and 1

Which will generate either a 0 or 1 if the i’th bit of n is set.


import binascii

def byte_to_binary(n):
    return ''.join(str((n & (1 << i)) and 1) for i in reversed(range(8)))

def hex_to_binary(h):
    return ''.join(byte_to_binary(ord(b)) for b in binascii.unhexlify(h))

print hex_to_binary('abc123efff')

>>> 1010101111000001001000111110111111111111

Edit: using the “new” ternary operator this:

(n & (1 << i)) and 1

Would become:

1 if n & (1 << i) or 0

(Which TBH I’m not sure how readable that is)


回答 7

这与Glen Maynard的解决方案略有不同,我认为这是正确的解决方法。它只是添加了padding元素。


    def hextobin(self, hexval):
        '''
        Takes a string representation of hex data with
        arbitrary length and converts to string representation
        of binary.  Includes padding 0s
        '''
        thelen = len(hexval)*4
        binval = bin(int(hexval, 16))[2:]
        while ((len(binval)) < thelen):
            binval = '0' + binval
        return binval

把它拉出课堂。self, 如果您使用的是独立脚本,则只需取出。

This is a slight touch up to Glen Maynard’s solution, which I think is the right way to do it. It just adds the padding element.


    def hextobin(self, hexval):
        '''
        Takes a string representation of hex data with
        arbitrary length and converts to string representation
        of binary.  Includes padding 0s
        '''
        thelen = len(hexval)*4
        binval = bin(int(hexval, 16))[2:]
        while ((len(binval)) &lt thelen):
            binval = '0' + binval
        return binval

Pulled it out of a class. Just take out self, if you’re working in a stand-alone script.


回答 8

使用内置的format()函数int()函数 简单易懂。这是亚伦答案的简化版本

int()

int(string, base)

格式()

format(integer, # of bits)

# w/o 0b prefix
>> format(int("ABC123EFFF", 16), "040b")
1010101111000001001000111110111111111111

# with 0b prefix
>> format(int("ABC123EFFF", 16), "#042b")
0b1010101111000001001000111110111111111111

# w/o 0b prefix + 64bit
>> format(int("ABC123EFFF", 16), "064b")
0000000000000000000000001010101111000001001000111110111111111111

另请参阅此答案

Use Built-in format() function and int() function It’s simple and easy to understand. It’s little bit simplified version of Aaron answer

int()

int(string, base)

format()

format(integer, # of bits)

Example

# w/o 0b prefix
>> format(int("ABC123EFFF", 16), "040b")
1010101111000001001000111110111111111111

# with 0b prefix
>> format(int("ABC123EFFF", 16), "#042b")
0b1010101111000001001000111110111111111111

# w/o 0b prefix + 64bit
>> format(int("ABC123EFFF", 16), "064b")
0000000000000000000000001010101111000001001000111110111111111111

See also this answer


回答 9

将每个十六进制数字替换为相应的4个二进制数字:

1 - 0001
2 - 0010
...
a - 1010
b - 1011
...
f - 1111

Replace each hex digit with the corresponding 4 binary digits:

1 - 0001
2 - 0010
...
a - 1010
b - 1011
...
f - 1111

回答 10

十六进制->十进制然后是十进制->二进制

#decimal to binary 
def d2b(n):
    bStr = ''
    if n < 0: raise ValueError, "must be a positive integer"
    if n == 0: return '0'
    while n > 0:
        bStr = str(n % 2) + bStr
        n = n >> 1    
    return bStr

#hex to binary
def h2b(hex):
    return d2b(int(hex,16))

hex –> decimal then decimal –> binary

#decimal to binary 
def d2b(n):
    bStr = ''
    if n < 0: raise ValueError, "must be a positive integer"
    if n == 0: return '0'
    while n > 0:
        bStr = str(n % 2) + bStr
        n = n >> 1    
    return bStr

#hex to binary
def h2b(hex):
    return d2b(int(hex,16))

回答 11

其他方式:

import math

def hextobinary(hex_string):
    s = int(hex_string, 16) 
    num_digits = int(math.ceil(math.log(s) / math.log(2)))
    digit_lst = ['0'] * num_digits
    idx = num_digits
    while s > 0:
        idx -= 1
        if s % 2 == 1: digit_lst[idx] = '1'
        s = s / 2
    return ''.join(digit_lst)

print hextobinary('abc123efff')

Another way:

import math

def hextobinary(hex_string):
    s = int(hex_string, 16) 
    num_digits = int(math.ceil(math.log(s) / math.log(2)))
    digit_lst = ['0'] * num_digits
    idx = num_digits
    while s > 0:
        idx -= 1
        if s % 2 == 1: digit_lst[idx] = '1'
        s = s / 2
    return ''.join(digit_lst)

print hextobinary('abc123efff')

回答 12

我将填充位数的计算添加到Onedinkenedi的解决方案中。这是结果函数:

def hextobin(h):
  return bin(int(h, 16))[2:].zfill(len(h) * 4)

其中16是您要转换的基数(十六进制),4是表示每个数字需要多少位,或者以小数位对数2为底。

I added the calculation for the number of bits to fill to Onedinkenedi’s solution. Here is the resulting function:

def hextobin(h):
  return bin(int(h, 16))[2:].zfill(len(h) * 4)

Where 16 is the base you’re converting from (hexadecimal), and 4 is how many bits you need to represent each digit, or log base 2 of the scale.


回答 13

 def conversion():
    e=raw_input("enter hexadecimal no.:")
    e1=("a","b","c","d","e","f")
    e2=(10,11,12,13,14,15)
    e3=1
    e4=len(e)
    e5=()
    while e3<=e4:
        e5=e5+(e[e3-1],)
        e3=e3+1
    print e5
    e6=1
    e8=()
    while e6<=e4:
        e7=e5[e6-1]
        if e7=="A":
            e7=10
        if e7=="B":
            e7=11
        if e7=="C":
            e7=12
        if e7=="D":
            e7=13
        if e7=="E":
            e7=14
        if e7=="F":
            e7=15
        else:
            e7=int(e7)
        e8=e8+(e7,)
        e6=e6+1
    print e8

    e9=1
    e10=len(e8)
    e11=()
    while e9<=e10:
        e12=e8[e9-1]
        a1=e12
        a2=()
        a3=1 
        while a3<=1:
            a4=a1%2
            a2=a2+(a4,)
            a1=a1/2
            if a1<2:
                if a1==1:
                    a2=a2+(1,)
                if a1==0:
                    a2=a2+(0,)
                a3=a3+1
        a5=len(a2)
        a6=1
        a7=""
        a56=a5
        while a6<=a5:
            a7=a7+str(a2[a56-1])
            a6=a6+1
            a56=a56-1
        if a5<=3:
            if a5==1:
                a8="000"
                a7=a8+a7
            if a5==2:
                a8="00"
                a7=a8+a7
            if a5==3:
                a8="0"
                a7=a8+a7
        else:
            a7=a7
        print a7,
        e9=e9+1
 def conversion():
    e=raw_input("enter hexadecimal no.:")
    e1=("a","b","c","d","e","f")
    e2=(10,11,12,13,14,15)
    e3=1
    e4=len(e)
    e5=()
    while e3<=e4:
        e5=e5+(e[e3-1],)
        e3=e3+1
    print e5
    e6=1
    e8=()
    while e6<=e4:
        e7=e5[e6-1]
        if e7=="A":
            e7=10
        if e7=="B":
            e7=11
        if e7=="C":
            e7=12
        if e7=="D":
            e7=13
        if e7=="E":
            e7=14
        if e7=="F":
            e7=15
        else:
            e7=int(e7)
        e8=e8+(e7,)
        e6=e6+1
    print e8

    e9=1
    e10=len(e8)
    e11=()
    while e9<=e10:
        e12=e8[e9-1]
        a1=e12
        a2=()
        a3=1 
        while a3<=1:
            a4=a1%2
            a2=a2+(a4,)
            a1=a1/2
            if a1<2:
                if a1==1:
                    a2=a2+(1,)
                if a1==0:
                    a2=a2+(0,)
                a3=a3+1
        a5=len(a2)
        a6=1
        a7=""
        a56=a5
        while a6<=a5:
            a7=a7+str(a2[a56-1])
            a6=a6+1
            a56=a56-1
        if a5<=3:
            if a5==1:
                a8="000"
                a7=a8+a7
            if a5==2:
                a8="00"
                a7=a8+a7
            if a5==3:
                a8="0"
                a7=a8+a7
        else:
            a7=a7
        print a7,
        e9=e9+1

回答 14

我有一个短暂的希望,可以帮助:-)

input = 'ABC123EFFF'
for index, value in enumerate(input):
    print(value)
    print(bin(int(value,16)+16)[3:])

string = ''.join([bin(int(x,16)+16)[3:] for y,x in enumerate(input)])
print(string)

首先,我使用您的输入并枚举以获得每个符号。然后我将其转换为二进制文件,并从第3个位置修剪到最后。获得0的技巧是将输入的最大值相加->在这种情况下始终为16 :-)

缩写形式为join方法。请享用。

i have a short snipped hope that helps :-)

input = 'ABC123EFFF'
for index, value in enumerate(input):
    print(value)
    print(bin(int(value,16)+16)[3:])

string = ''.join([bin(int(x,16)+16)[3:] for y,x in enumerate(input)])
print(string)

first i use your input and enumerate it to get each symbol. then i convert it to binary and trim from 3th position to the end. The trick to get the 0 is to add the max value of the input -> in this case always 16 :-)

the short form ist the join method. Enjoy.


回答 15

# Python Program - Convert Hexadecimal to Binary
hexdec = input("Enter Hexadecimal string: ")
print(hexdec," in Binary = ", end="")    # end is by default "\n" which prints a new line
for _hex in hexdec:
    dec = int(_hex, 16)    # 16 means base-16 wich is hexadecimal
    print(bin(dec)[2:].rjust(4,"0"), end="")    # the [2:] skips 0b, and the 
# Python Program - Convert Hexadecimal to Binary
hexdec = input("Enter Hexadecimal string: ")
print(hexdec," in Binary = ", end="")    # end is by default "\n" which prints a new line
for _hex in hexdec:
    dec = int(_hex, 16)    # 16 means base-16 wich is hexadecimal
    print(bin(dec)[2:].rjust(4,"0"), end="")    # the [2:] skips 0b, and the 

回答 16

二进制版本的ABC123EFFF实际上是1010101111000001001001000111110111111111111

对于几乎所有应用程序,您都希望二进制版本的长度为4的倍数,且前导填充为0。

要在Python中获得此代码:

def hex_to_binary( hex_code ):
  bin_code = bin( hex_code )[2:]
  padding = (4-len(bin_code)%4)%4
  return '0'*padding + bin_code

范例1:

>>> hex_to_binary( 0xABC123EFFF )
'1010101111000001001000111110111111111111'

范例2:

>>> hex_to_binary( 0x7123 )
'0111000100100011'

请注意,这也适用于Micropython :)

The binary version of ABC123EFFF is actually 1010101111000001001000111110111111111111

For almost all applications you want the binary version to have a length that is a multiple of 4 with leading padding of 0s.

To get this in Python:

def hex_to_binary( hex_code ):
  bin_code = bin( hex_code )[2:]
  padding = (4-len(bin_code)%4)%4
  return '0'*padding + bin_code

Example 1:

>>> hex_to_binary( 0xABC123EFFF )
'1010101111000001001000111110111111111111'

Example 2:

>>> hex_to_binary( 0x7123 )
'0111000100100011'

Note that this also works in Micropython :)


回答 17

只需使用模块编码即可 (注意:我是该模块的作者)

您可以在那里将正十六进制转换为二进制。

  1. 使用pip安装
pip install coden
  1. 兑换
a_hexadecimal_number = "f1ff"
binary_output = coden.hex_to_bin(a_hexadecimal_number)

转换关键字为:

  • 十六进制的hexadeimal
  • 二进制
  • INT十进制
  • _to_-函数的转换关键字

因此,您还可以格式化:e。十六进制输出= bin_to_hex(a_binary_number)

Just use the module coden (note: I am the author of the module)

You can convert haxedecimal to binary there.

  1. Install using pip
pip install coden
  1. Convert
a_hexadecimal_number = "f1ff"
binary_output = coden.hex_to_bin(a_hexadecimal_number)

The converting Keywords are:

  • hex for hexadeimal
  • bin for binary
  • int for decimal
  • _to_ – the converting keyword for the function

So you can also format: e. hexadecimal_output = bin_to_hex(a_binary_number)


回答 18

HEX_TO_BINARY_CONVERSION_TABLE = {‘0’:’0000’,

                              '1': '0001',

                              '2': '0010',

                              '3': '0011',

                              '4': '0100',

                              '5': '0101',

                              '6': '0110',

                              '7': '0111',

                              '8': '1000',

                              '9': '1001',

                              'a': '1010',

                              'b': '1011',

                              'c': '1100',

                              'd': '1101',

                              'e': '1110',

                              'f': '1111'}

def hex_to_binary(hex_string):
    binary_string = ""
    for character in hex_string:
        binary_string += HEX_TO_BINARY_CONVERSION_TABLE[character]
    return binary_string

HEX_TO_BINARY_CONVERSION_TABLE = { ‘0’: ‘0000’,

                              '1': '0001',

                              '2': '0010',

                              '3': '0011',

                              '4': '0100',

                              '5': '0101',

                              '6': '0110',

                              '7': '0111',

                              '8': '1000',

                              '9': '1001',

                              'a': '1010',

                              'b': '1011',

                              'c': '1100',

                              'd': '1101',

                              'e': '1110',

                              'f': '1111'}

def hex_to_binary(hex_string):
    binary_string = ""
    for character in hex_string:
        binary_string += HEX_TO_BINARY_CONVERSION_TABLE[character]
    return binary_string

回答 19

a = raw_input('hex number\n')
length = len(a)
ab = bin(int(a, 16))[2:]
while len(ab)<(length * 4):
    ab = '0' + ab
print ab
a = raw_input('hex number\n')
length = len(a)
ab = bin(int(a, 16))[2:]
while len(ab)<(length * 4):
    ab = '0' + ab
print ab

回答 20

import binascii
hexa_input = input('Enter hex String to convert to Binary: ')
pad_bits=len(hexa_input)*4
Integer_output=int(hexa_input,16)
Binary_output= bin(Integer_output)[2:]. zfill(pad_bits)
print(Binary_output)
"""zfill(x) i.e. x no of 0 s to be padded left - Integers will overwrite 0 s
starting from right side but remaining 0 s will display till quantity x
[y:] where y is no of output chars which need to destroy starting from left"""
import binascii
hexa_input = input('Enter hex String to convert to Binary: ')
pad_bits=len(hexa_input)*4
Integer_output=int(hexa_input,16)
Binary_output= bin(Integer_output)[2:]. zfill(pad_bits)
print(Binary_output)
"""zfill(x) i.e. x no of 0 s to be padded left - Integers will overwrite 0 s
starting from right side but remaining 0 s will display till quantity x
[y:] where y is no of output chars which need to destroy starting from left"""

回答 21

no=raw_input("Enter your number in hexa decimal :")
def convert(a):
    if a=="0":
        c="0000"
    elif a=="1":
        c="0001"
    elif a=="2":
        c="0010"
    elif a=="3":
        c="0011"
    elif a=="4":
        c="0100"
    elif a=="5":
        c="0101"
    elif a=="6":
        c="0110"
    elif a=="7":
        c="0111"
    elif a=="8":
        c="1000"
    elif a=="9":
        c="1001"
    elif a=="A":
        c="1010"
    elif a=="B":
        c="1011"
    elif a=="C":
        c="1100"
    elif a=="D":
        c="1101"
    elif a=="E":
        c="1110"
    elif a=="F":
        c="1111"
    else:
        c="invalid"
    return c

a=len(no)
b=0
l=""
while b<a:
    l=l+convert(no[b])
    b+=1
print l
no=raw_input("Enter your number in hexa decimal :")
def convert(a):
    if a=="0":
        c="0000"
    elif a=="1":
        c="0001"
    elif a=="2":
        c="0010"
    elif a=="3":
        c="0011"
    elif a=="4":
        c="0100"
    elif a=="5":
        c="0101"
    elif a=="6":
        c="0110"
    elif a=="7":
        c="0111"
    elif a=="8":
        c="1000"
    elif a=="9":
        c="1001"
    elif a=="A":
        c="1010"
    elif a=="B":
        c="1011"
    elif a=="C":
        c="1100"
    elif a=="D":
        c="1101"
    elif a=="E":
        c="1110"
    elif a=="F":
        c="1111"
    else:
        c="invalid"
    return c

a=len(no)
b=0
l=""
while b<a:
    l=l+convert(no[b])
    b+=1
print l

查找字符串中子字符串的最后一次出现,将其替换

问题:查找字符串中子字符串的最后一次出现,将其替换

因此,我有一长串具有相同格式的字符串,并且我想找到最后一个“”。每个字符,然后将其替换为“。-”。我尝试使用rfind,但似乎无法正确利用它来执行此操作。

So I have a long list of strings in the same format, and I want to find the last “.” character in each one, and replace it with “. – “. I’ve tried using rfind, but I can’t seem to utilize it properly to do this.


回答 0

这应该做

old_string = "this is going to have a full stop. some written sstuff!"
k = old_string.rfind(".")
new_string = old_string[:k] + ". - " + old_string[k+1:]

This should do it

old_string = "this is going to have a full stop. some written sstuff!"
k = old_string.rfind(".")
new_string = old_string[:k] + ". - " + old_string[k+1:]

回答 1

要从右侧替换:

def replace_right(source, target, replacement, replacements=None):
    return replacement.join(source.rsplit(target, replacements))

正在使用:

>>> replace_right("asd.asd.asd.", ".", ". -", 1)
'asd.asd.asd. -'

To replace from the right:

def replace_right(source, target, replacement, replacements=None):
    return replacement.join(source.rsplit(target, replacements))

In use:

>>> replace_right("asd.asd.asd.", ".", ". -", 1)
'asd.asd.asd. -'

回答 2

我会使用正则表达式:

import re
new_list = [re.sub(r"\.(?=[^.]*$)", r". - ", s) for s in old_list]

I would use a regex:

import re
new_list = [re.sub(r"\.(?=[^.]*$)", r". - ", s) for s in old_list]

回答 3

一行代码是:

str=str[::-1].replace(".",".-",1)[::-1]

A one liner would be :

str=str[::-1].replace(".",".-",1)[::-1]


回答 4

您可以使用下面的函数来代替从右至右的单词的第一个出现。

def replace_from_right(text: str, original_text: str, new_text: str) -> str:
    """ Replace first occurrence of original_text by new_text. """
    return text[::-1].replace(original_text[::-1], new_text[::-1], 1)[::-1]

You can use the function below which replaces the first occurrence of the word from right.

def replace_from_right(text: str, original_text: str, new_text: str) -> str:
    """ Replace first occurrence of original_text by new_text. """
    return text[::-1].replace(original_text[::-1], new_text[::-1], 1)[::-1]

回答 5

a = "A long string with a . in the middle ending with ."

#如果要查找任何字符串的最后一次出现的索引,在我们的示例中,我们#将查找with的最后一次出现的索引

index = a.rfind("with") 

#结果将是44,因为索引从0开始。

a = "A long string with a . in the middle ending with ."

# if you want to find the index of the last occurrence of any string, In our case we #will find the index of the last occurrence of with

index = a.rfind("with") 

# the result will be 44, as index starts from 0.


回答 6

天真的方法:

a = "A long string with a . in the middle ending with ."
fchar = '.'
rchar = '. -'
a[::-1].replace(fchar, rchar[::-1], 1)[::-1]

Out[2]: 'A long string with a . in the middle ending with . -'

Aditya Sihag的回答只有一个rfind

pos = a.rfind('.')
a[:pos] + '. -' + a[pos+1:]

Naïve approach:

a = "A long string with a . in the middle ending with ."
fchar = '.'
rchar = '. -'
a[::-1].replace(fchar, rchar[::-1], 1)[::-1]

Out[2]: 'A long string with a . in the middle ending with . -'

Aditya Sihag’s answer with a single rfind:

pos = a.rfind('.')
a[:pos] + '. -' + a[pos+1:]

如何删除第一双和最后一个双引号?

问题:如何删除第一双和最后一个双引号?

我想从中删除双引号:

string = '"" " " ""\\1" " "" ""'

获得:

string = '" " " ""\\1" " "" "'

我试图用rstriplstripstrip('[^\"]|[\"$]')但没有奏效。

我怎样才能做到这一点?

I want to strip double quotes from:

string = '"" " " ""\\1" " "" ""'

to obtain:

string = '" " " ""\\1" " "" "'

I tried to use rstrip, lstrip and strip('[^\"]|[\"$]') but it did not work.

How can I do this?


回答 0

如果您要去除的引号总是像您所说的那样“第一和最后”,那么您可以简单地使用:

string = string[1:-1]

If the quotes you want to strip are always going to be “first and last” as you said, then you could simply use:

string = string[1:-1]


回答 1

如果您不能假设您处理的所有字符串都带有双引号,则可以使用如下所示的内容:

if string.startswith('"') and string.endswith('"'):
    string = string[1:-1]

编辑:

我确定您只是string在此处用作示例的变量名,并且在您的真实代码中它具有有用的名称,但是我不得不警告您string在标准库中有一个名为模块的模块。它不会自动加载,但是如果您曾经使用过,请import string确保您的变量不会使其黯然失色。

If you can’t assume that all the strings you process have double quotes you can use something like this:

if string.startswith('"') and string.endswith('"'):
    string = string[1:-1]

Edit:

I’m sure that you just used string as the variable name for exemplification here and in your real code it has a useful name, but I feel obliged to warn you that there is a module named string in the standard libraries. It’s not loaded automatically, but if you ever use import string make sure your variable doesn’t eclipse it.


回答 2

要删除第一个和最后符,并且仅在所讨论的字符为双引号的情况下,才分别删除:

import re

s = re.sub(r'^"|"$', '', s)

请注意,RE模式与您给定的RE模式不同,并且操作sub(用空格替换字符串)(“替代”)(strip是字符串方法,但与您的要求有所不同,如其他答案所示)。

To remove the first and last characters, and in each case do the removal only if the character in question is a double quote:

import re

s = re.sub(r'^"|"$', '', s)

Note that the RE pattern is different than the one you had given, and the operation is sub (“substitute”) with an empty replacement string (strip is a string method but does something pretty different from your requirements, as other answers have indicated).


回答 3

重要说明:我正在扩展问题/答案以去除单引号或双引号。我将问题解释为必须同时存在两个引号,并且两个引号都必须匹配才能执行删除操作。否则,字符串将保持不变。

要“取消引用”字符串表示形式,该字符串表示形式周围可能带有单引号或双引号(这是@tgray答案的扩展):

def dequote(s):
    """
    If a string has single or double quotes around it, remove them.
    Make sure the pair of quotes match.
    If a matching pair of quotes is not found, return the string unchanged.
    """
    if (s[0] == s[-1]) and s.startswith(("'", '"')):
        return s[1:-1]
    return s

说明:

startswith可以取一个元组,以匹配多个备选方案中的任何一个。的原因倍增括号(())是使得我们通过一个参数("'", '"')startswith(),以指定允许的前缀,而不是两个参数"'"'"',这将被解释为一个前缀和(无效的)的开始位置。

s[-1] 是字符串中的最后符。

测试:

print( dequote("\"he\"l'lo\"") )
print( dequote("'he\"l'lo'") )
print( dequote("he\"l'lo") )
print( dequote("'he\"l'lo\"") )

=>

he"l'lo
he"l'lo
he"l'lo
'he"l'lo"

(对我来说,正则表达式不是显而易见的,所以我没有尝试扩展@Alex的答案。)

IMPORTANT: I’m extending the question/answer to strip either single or double quotes. And I interpret the question to mean that BOTH quotes must be present, and matching, to perform the strip. Otherwise, the string is returned unchanged.

To “dequote” a string representation, that might have either single or double quotes around it (this is an extension of @tgray’s answer):

def dequote(s):
    """
    If a string has single or double quotes around it, remove them.
    Make sure the pair of quotes match.
    If a matching pair of quotes is not found, return the string unchanged.
    """
    if (s[0] == s[-1]) and s.startswith(("'", '"')):
        return s[1:-1]
    return s

Explanation:

startswith can take a tuple, to match any of several alternatives. The reason for the DOUBLED parentheses (( and )) is so that we pass ONE parameter ("'", '"') to startswith(), to specify the permitted prefixes, rather than TWO parameters "'" and '"', which would be interpreted as a prefix and an (invalid) start position.

s[-1] is the last character in the string.

Testing:

print( dequote("\"he\"l'lo\"") )
print( dequote("'he\"l'lo'") )
print( dequote("he\"l'lo") )
print( dequote("'he\"l'lo\"") )

=>

he"l'lo
he"l'lo
he"l'lo
'he"l'lo"

(For me, regex expressions are non-obvious to read, so I didn’t try to extend @Alex’s answer.)


回答 4

如果字符串始终如您所显示:

string[1:-1]

If string is always as you show:

string[1:-1]

回答 5

快完成了 引用http://docs.python.org/library/stdtypes.html?highlight=strip#str.strip

chars参数是一个字符串,指定要删除的字符集。

[…]

chars参数不是前缀或后缀;而是删除其值的所有组合:

因此,该参数不是正则表达式。

>>> string = '"" " " ""\\1" " "" ""'
>>> string.strip('"')
' " " ""\\1" " "" '
>>> 

注意,这并不是您所要求的,因为它在字符串的两端都使用了多个引号!

Almost done. Quoting from http://docs.python.org/library/stdtypes.html?highlight=strip#str.strip

The chars argument is a string specifying the set of characters to be removed.

[…]

The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped:

So the argument is not a regexp.

>>> string = '"" " " ""\\1" " "" ""'
>>> string.strip('"')
' " " ""\\1" " "" '
>>> 

Note, that this is not exactly what you requested, because it eats multiple quotes from both end of the string!


回答 6

如果您确定要删除的开头和结尾处有一个“,请执行以下操作:

string = string[1:len(string)-1]

要么

string = string[1:-1]

If you are sure there is a ” at the beginning and at the end, which you want to remove, just do:

string = string[1:len(string)-1]

or

string = string[1:-1]

回答 7

从字符串的开头和结尾删除确定的字符串。

s = '""Hello World""'
s.strip('""')

> 'Hello World'

Remove a determinated string from start and end from a string.

s = '""Hello World""'
s.strip('""')

> 'Hello World'

回答 8

我有一些代码需要去除单引号或双引号,而我不能简单地ast.literal_eval。

if len(arg) > 1 and arg[0] in ('"\'') and arg[-1] == arg[0]:
    arg = arg[1:-1]

这类似于ToolmakerSteve的答案,但是它允许长度为0的字符串,并且不能将单个字符"转换为空字符串。

I have some code that needs to strip single or double quotes, and I can’t simply ast.literal_eval it.

if len(arg) > 1 and arg[0] in ('"\'') and arg[-1] == arg[0]:
    arg = arg[1:-1]

This is similar to ToolmakerSteve’s answer, but it allows 0 length strings, and doesn’t turn the single character " into an empty string.


回答 9

在您的示例中,您可以使用地带,但必须提供空间

string = '"" " " ""\\1" " "" ""'
string.strip('" ')  # output '\\1'

注意输出中的\是字符串输出的标准python引号

您的变量的值为’\\ 1′

in your example you could use strip but you have to provide the space

string = '"" " " ""\\1" " "" ""'
string.strip('" ')  # output '\\1'

note the \’ in the output is the standard python quotes for string output

the value of your variable is ‘\\1’


回答 10

下面的函数将去除空的spces并返回不带引号的字符串。如果没有引号,它将返回相同的字符串(剥离)

def removeQuote(str):
str = str.strip()
if re.search("^[\'\"].*[\'\"]$",str):
    str = str[1:-1]
    print("Removed Quotes",str)
else:
    print("Same String",str)
return str

Below function will strip the empty spces and return the strings without quotes. If there are no quotes then it will return same string(stripped)

def removeQuote(str):
str = str.strip()
if re.search("^[\'\"].*[\'\"]$",str):
    str = str[1:-1]
    print("Removed Quotes",str)
else:
    print("Same String",str)
return str

回答 11

从开始Python 3.9,您可以使用removeprefixremovesuffix

'"" " " ""\\1" " "" ""'.removeprefix('"').removesuffix('"')
# '" " " ""\\1" " "" "'

Starting in Python 3.9, you can use removeprefix and removesuffix:

'"" " " ""\\1" " "" ""'.removeprefix('"').removesuffix('"')
# '" " " ""\\1" " "" "'

回答 12

在字符串中找到第一个和最后一个“的位置

>>> s = '"" " " ""\\1" " "" ""'
>>> l = s.find('"')
>>> r = s.rfind('"')

>>> s[l+1:r]
'" " " ""\\1" " "" "'

find the position of the first and the last ” in your string

>>> s = '"" " " ""\\1" " "" ""'
>>> l = s.find('"')
>>> r = s.rfind('"')

>>> s[l+1:r]
'" " " ""\\1" " "" "'

如何在字符串Python中获取:之前的所有内容

问题:如何在字符串Python中获取:之前的所有内容

我正在寻找一种方法来在:之前获取字符串中的所有字母,但是我不知道从哪里开始。我会使用正则表达式吗?如果可以,怎么办?

string = "Username: How are you today?"

有人可以给我示范我可以做什么吗?

I am looking for a way to get all of the letters in a string before a : but I have no idea on where to start. Would I use regex? If so how?

string = "Username: How are you today?"

Can someone show me a example on what I could do?


回答 0

只需使用该split功能。它返回一个列表,因此您可以保留第一个元素:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

Just use the split function. It returns a list, so you can keep the first element:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

回答 1

使用index

>>> string = "Username: How are you today?"
>>> string[:string.index(":")]
'Username'

该索引将给您以下位置 :在字符串中,然后可以对其进行切片。

如果要使用正则表达式:

>>> import re
>>> re.match("(.*?):",string).group()
'Username'                       

match 从字符串开头开始匹配。

你也可以使用 itertools.takewhile

>>> import itertools
>>> "".join(itertools.takewhile(lambda x: x!=":", string))
'Username'

Using index:

>>> string = "Username: How are you today?"
>>> string[:string.index(":")]
'Username'

The index will give you the position of : in string, then you can slice it.

If you want to use regex:

>>> import re
>>> re.match("(.*?):",string).group()
'Username'                       

match matches from the start of the string.

you can also use itertools.takewhile

>>> import itertools
>>> "".join(itertools.takewhile(lambda x: x!=":", string))
'Username'

回答 2

你不需要regex这个

>>> s = "Username: How are you today?"

您可以使用split方法拆分的字符串':'的字符

>>> s.split(':')
['Username', ' How are you today?']

并切出元素[0]以获得字符串的第一部分

>>> s.split(':')[0]
'Username'

You don’t need regex for this

>>> s = "Username: How are you today?"

You can use the split method to split the string on the ':' character

>>> s.split(':')
['Username', ' How are you today?']

And slice out element [0] to get the first part of the string

>>> s.split(':')[0]
'Username'

回答 3

我已经在Python 3.7.0(IPython)下对这些各种技术进行了基准测试。

TLDR

  • 最快(当分割符号 c已知):预编译的正则表达式。
  • 最快(否则): s.partition(c)[0]
  • 安全(即何时c可能不在s):分区,拆分。
  • 不安全:索引,正则表达式。

import string, random, re

SYMBOLS = string.ascii_uppercase + string.digits
SIZE = 100

def create_test_set(string_length):
    for _ in range(SIZE):
        random_string = ''.join(random.choices(SYMBOLS, k=string_length))
        yield (random.choice(random_string), random_string)

for string_length in (2**4, 2**8, 2**16, 2**32):
    print("\nString length:", string_length)
    print("  regex (compiled):", end=" ")
    test_set_for_regex = ((re.compile("(.*?)" + c).match, s) for (c, s) in test_set)
    %timeit [re_match(s).group() for (re_match, s) in test_set_for_regex]
    test_set = list(create_test_set(16))
    print("  partition:       ", end=" ")
    %timeit [s.partition(c)[0] for (c, s) in test_set]
    print("  index:           ", end=" ")
    %timeit [s[:s.index(c)] for (c, s) in test_set]
    print("  split (limited): ", end=" ")
    %timeit [s.split(c, 1)[0] for (c, s) in test_set]
    print("  split:           ", end=" ")
    %timeit [s.split(c)[0] for (c, s) in test_set]
    print("  regex:           ", end=" ")
    %timeit [re.match("(.*?)" + c, s).group() for (c, s) in test_set]

结果

String length: 16
  regex (compiled): 156 ns ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.3 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            26.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.8 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.3 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 256
  regex (compiled): 167 ns ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  index:            28.6 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.4 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            31.5 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            148 µs ± 7.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

String length: 65536
  regex (compiled): 173 ns ± 3.95 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.2 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.5 µs ± 377 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 4294967296
  regex (compiled): 165 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.9 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 571 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.1 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            28.1 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            137 µs ± 6.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I have benchmarked these various technics under Python 3.7.0 (IPython).

TLDR

  • fastest (when the split symbol c is known): pre-compiled regex.
  • fastest (otherwise): s.partition(c)[0].
  • safe (i.e., when c may not be in s): partition, split.
  • unsafe: index, regex.

Code

import string, random, re

SYMBOLS = string.ascii_uppercase + string.digits
SIZE = 100

def create_test_set(string_length):
    for _ in range(SIZE):
        random_string = ''.join(random.choices(SYMBOLS, k=string_length))
        yield (random.choice(random_string), random_string)

for string_length in (2**4, 2**8, 2**16, 2**32):
    print("\nString length:", string_length)
    print("  regex (compiled):", end=" ")
    test_set_for_regex = ((re.compile("(.*?)" + c).match, s) for (c, s) in test_set)
    %timeit [re_match(s).group() for (re_match, s) in test_set_for_regex]
    test_set = list(create_test_set(16))
    print("  partition:       ", end=" ")
    %timeit [s.partition(c)[0] for (c, s) in test_set]
    print("  index:           ", end=" ")
    %timeit [s[:s.index(c)] for (c, s) in test_set]
    print("  split (limited): ", end=" ")
    %timeit [s.split(c, 1)[0] for (c, s) in test_set]
    print("  split:           ", end=" ")
    %timeit [s.split(c)[0] for (c, s) in test_set]
    print("  regex:           ", end=" ")
    %timeit [re.match("(.*?)" + c, s).group() for (c, s) in test_set]

Results

String length: 16
  regex (compiled): 156 ns ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.3 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            26.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.8 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.3 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 256
  regex (compiled): 167 ns ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  index:            28.6 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.4 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            31.5 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            148 µs ± 7.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

String length: 65536
  regex (compiled): 173 ns ± 3.95 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.2 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.5 µs ± 377 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 4294967296
  regex (compiled): 165 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.9 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 571 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.1 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            28.1 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            137 µs ± 6.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 4

为此,partition()可能比split()更好,因为在没有定界符或更多定界符的情况下,它具有更好的可预测结果。

partition() may be better then split() for this purpose as it has the better predicable results for situations you have no delimiter or more delimiters.


Python str与unicode类型

问题:Python str与unicode类型

使用Python 2.7,我想知道使用type unicode代替真正的优势是什么str,因为它们似乎都能够容纳Unicode字符串。除了能够unicode使用转义字符在字符串中设置Unicode代码之外,还有什么特殊的原因\吗?:

使用以下命令执行模块:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

结果:á,á

编辑:

使用Python Shell进行更多测试:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

因此,该unicode字符串似乎是使用latin1而不是编码的utf-8,而原始字符串是使用utf-8?编码的 我现在更加困惑!:S

Working with Python 2.7, I’m wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I’m even more confused now! :S


回答 0

unicode用于处理文本。文本是一个代码点序列,可能大于一个字节。文本可以被编码在一个特定的编码来表示文本作为原始字节(例如utf-8latin-1…)。

注意,这unicode 是没有编码的!python使用的内部表示形式是实现细节,只要它能够表示所需的代码点,您就不必在意它。

相反,str在Python 2中是字节的简单序列。它不代表文字!

您可以将其unicode视为某些文本的一般表示形式,可以用许多不同的方式将其编码为通过表示的二进制数据序列str

注意:在Python 3中,unicode已重命名为,str并且bytes为普通字节序列提供了一种新类型。

您可以看到一些差异:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte

请注意,使用时,str您可以控制特定编码表示形式的单个字节,而使用时unicode,则只能在代码点级别进行控制。例如,您可以执行以下操作:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

以前是有效的UTF-8,现在已经不复存在了。使用unicode字符串,您不能以结果字符串不是有效的unicode文本的方式进行操作。您可以删除代码点,将代码点替换为其他代码点等,但不能与内部表示混淆。

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn’t care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn’t anymore. Using a unicode string you cannot operate in such a way that the resulting string isn’t valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.


回答 1

Unicode和编码是完全不同的,无关的东西。

统一码

为每个字符分配一个数字ID:

  • 0x41→A
  • 0xE1→á
  • 0x414→Д

因此,Unicode将数字0x41分配给A,将0xE1分配给á,将0x414分配给Д。

即使是我使用的小箭头也有其Unicode数字,即0x2192。甚至表情符号都有其Unicode数字,😂是0x1F602。

您可以在此表中查找所有字符的Unicode数字。特别是,你可以找到上面的前三个字符这里,箭头在这里,和表情符号在这里

这些由Unicode分配给所有字符的数字称为代码点

所有这些的目的是提供一种明确地引用每个字符的方法。例如,如果我在谈论😂,而不是说“你知道,这笑着哭的表情含泪”,我可以说Unicode代码点0x1F602。比较容易,对吧?

请注意,Unicode代码点通常使用前导格式U+,然后将十六进制数字值填充为至少4位数字。因此,以上示例为U + 0041,U + 00E1,U + 0414,U + 2192,U + 1F602。

Unicode代码点的范围从U + 0000到U + 10FFFF。那是1,114,112数字。这些数字中的2048个用于代理,因此,剩下1,112,064。这意味着,Unicode可以为1,112,064个不同的字符分配唯一的ID(代码点)。尚未将所有这些代码点都分配给一个字符,并且Unicode不断扩展(例如,当引入新的表情符号时)。

要记住的重要一点是,所有Unicode所做的就是为每个字符分配一个称为代码点的数字ID,以便于进行明确的引用。

编码方式

将字符映射到位模式。

这些位模式用于表示计算机内存或磁盘上的字符。

有许多不同的编码覆盖了字符的不同子集。在说英语的世界中,最常见的编码如下:

ASCII码

128个字符(代码点U + 0000到U + 007F)映射到长度为7的位模式。

例:

  • a→1100001(0x61)

您可以在此表中看到所有映射。

ISO 8859-1(又名Latin-1)

191个字符(代码点U + 0020到U + 007E和U + 00A0到U + 00FF)映射到长度为8的位模式。

例:

  • a→01100001(0x61)
  • á→11100001(0xE1)

您可以在此表中看到所有映射。

UTF-8

1,112,064个字符(所有现有的Unicode代码点)映射到长度为8、16、24或32位(即1、2、3或4个字节)的位模式。

例:

  • a→01100001(0x61)
  • á→11000011 10100001(0xC3 0xA1)
  • ≠→11100010 10001001 10100000(0xE2 0x89 0xA0)
  • 😂→11110000 10011111 10011000 10000010(0xF0 0x9F 0x98 0x82)

UTF-8将字符编码为位字符串的方式在此处得到了很好的描述。

Unicode和编码

通过上面的示例,可以清楚地了解Unicode是如何有用的。

例如,如果我是Latin-1,并且想解释一下á的编码,则无需说:

“我用aigu(或您将其称为上升条)编码为11100001”

但是我只能说:

“我将U + 00E1编码为11100001”

如果我是UTF-8,我可以说:

“我又将U + 00E1编码为11000011 10100001”

每个人都清楚知道我们指的是哪个角色。

现在到经常出现的混乱

的确,如果将编码的位模式解释为二进制数,则有时与此字符的Unicode代码点相同。

例如:

  • ASCII编码一个为1100001,您可以解释为十六进制数0x61,和的Unicode代码点一个U + 0061
  • Latin-1将á编码为11100001,可以将其解释为十六进制数字0xE1,而á的Unicode代码点是U + 00E1

当然,为了方便起见,已经对此进行了安排。但是您应该将其视为纯粹的巧合。用于表示内存中字符的位模式与该字符的Unicode代码点没有任何关联。

甚至没人说您必须将11100001之类的字符串解释为二进制数。只需将其视为Latin-1用来编码字符á的位序列即可。

回到您的问题

您的Python解释器使用的编码为UTF-8

这是您的示例中发生的事情:

例子1

以下代码以UTF-8编码字符á。这将产生位字符串11000011 10100001,该位字符串将保存在变量中a

>>> a = 'á'

当您查看的值时a,其内容11000011 10100001的格式设置为十六进制数字0xC3 0xA1,并输出为'\xc3\xa1'

>>> a
'\xc3\xa1'

例子2

下面的代码将á的Unicode代码点U + 00E1保存在变量中ua(我们不知道Python内部使用哪种数据格式在内存中表示代码点U + 00E1,这对我们来说并不重要):

>>> ua = u'á'

当您查看的值时ua,Python会告诉您它包含代码点U + 00E1:

>>> ua
u'\xe1'

例子3

以下代码使用UTF-8对Unicode代码点U + 00E1(表示字符á)进行编码,结果得到位模式1100001110100001。同样,对于输出,该位模式也表示为十六进制数字0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

例子4

下面的代码使用Latin-1对Unicode代码点U + 00E1(表示字符á)进行编码,从而得到位模式11100001。对于输出,该位模式表示为十六进制数字0xE1,巧合的是,其与初始字符相同。码点U + 00E1:

>>> ua.encode('latin1')
'\xe1'

Unicode对象ua和Latin-1编码之间没有关系。á的代码点为U + 00E1,而á的Latin-1编码为0xE1(如果将编码的位模式解释为二进制数)纯属巧合。

Unicode and encodings are completely different, unrelated things.

Unicode

Assigns a numeric ID to each character:

  • 0x41 → A
  • 0xE1 → á
  • 0x414 → Д

So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.

Even the little arrow → I used has its Unicode number, it’s 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.

You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.

These numbers assigned to all characters by Unicode are called code points.

The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I’m talking about 😂, instead of saying “you know, this laughing emoji with tears”, I can just say, Unicode code point 0x1F602. Easier, right?

Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.

Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).

The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.

Encodings

Map characters to bit patterns.

These bit patterns are used to represent the characters in computer memory or on disk.

There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:

ASCII

Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.

Example:

  • a → 1100001 (0x61)

You can see all the mappings in this table.

ISO 8859-1 (aka Latin-1)

Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.

Example:

  • a → 01100001 (0x61)
  • á → 11100001 (0xE1)

You can see all the mappings in this table.

UTF-8

Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).

Example:

  • a → 01100001 (0x61)
  • á → 11000011 10100001 (0xC3 0xA1)
  • ≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
  • 😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)

The way UTF-8 encodes characters to bit strings is very well described here.

Unicode and Encodings

Looking at the above examples, it becomes clear how Unicode is useful.

For example, if I’m Latin-1 and I want to explain my encoding of á, I don’t need to say:

“I encode that a with an aigu (or however you call that rising bar) as 11100001”

But I can just say:

“I encode U+00E1 as 11100001”

And if I’m UTF-8, I can say:

“Me, in turn, I encode U+00E1 as 11000011 10100001”

And it’s unambiguously clear to everybody which character we mean.

Now to the often arising confusion

It’s true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.

For example:

  • ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
  • Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.

Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.

Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.

Back to your question

The encoding used by your Python interpreter is UTF-8.

Here’s what’s going on in your examples:

Example 1

The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.

>>> a = 'á'

When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':

>>> a
'\xc3\xa1'

Example 2

The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don’t know which data format Python uses internally to represent the code point U+00E1 in memory, and it’s unimportant to us):

>>> ua = u'á'

When you look at the value of ua, Python tells you that it contains the code point U+00E1:

>>> ua
u'\xe1'

Example 3

The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

Example 4

The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:

>>> ua.encode('latin1')
'\xe1'

There’s no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.


回答 2

您的终端碰巧已配置为UTF-8。

印刷的事实a是一个巧合;您正在将原始UTF-8字节写入终端。a是长度的值2,含有两个字节,十六进制值C3和A1,而ua是长度的Unicode值一个,包含一个编码点U + 00E1。

这种长度上的差异是使用Unicode值的主要原因之一。您无法轻松地测量字节字符串中文字符的数量;该len()字节串的告诉你有多少字节被使用,许多字符没有编码方式。

你可以看到差异,当你编码的Unicode值到不同的输出编码:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

请注意,Unicode标准的前256个代码点与Latin 1标准匹配,因此U + 00E1代码点被编码为Latin 1作为具有十六进制值E1的字节。

此外,Python同样在unicode和字节字符串的表示形式中使用转义码,并且使用\x..转义值表示不可打印ASCII的低代码点。这就是为什么一个Unicode字符串,128米255的外观之间的代码点只是喜欢拉丁1个编码。如果您的unicode字符串的代码点超过U + 00FF,\u....则使用不同的转义序列,并使用四位十六进制值。

看来您尚未完全了解Unicode和编码之间的区别。在继续之前,请务必阅读以下文章:

Your terminal happens to be configured to UTF-8.

The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.

This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.

You can see the difference when you encode the unicode value to different output encodings:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.

Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.

It looks like you don’t yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:


回答 3

当您将a定义为unicode时,字符a和á相等。否则,á被视为两个字符。尝试len(a)和len(au)。除此之外,在其他环境中工作时可能需要编码。例如,如果您使用md5,则a和ua的值将不同

When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua


在Python中使用换行符分隔字符串

问题:在Python中使用换行符分隔字符串

我需要定界包含新行的字符串。我将如何实现?请参考下面的代码。

输入:

data = """a,b,c
d,e,f
g,h,i
j,k,l"""

所需的输出:

['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

我尝试了以下方法:

1. output = data.split('\n')
2. output = data.split('/n')
3. output = data.rstrip().split('\n')

I need to delimit the string which has new line in it. How would I achieve it? Please refer below code.

Input:

data = """a,b,c
d,e,f
g,h,i
j,k,l"""

Output desired:

['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

I have tried the below approaches:

1. output = data.split('\n')
2. output = data.split('/n')
3. output = data.rstrip().split('\n')

回答 0

str.splitlines 方法应该为您提供确切的信息。

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

str.splitlines method should give you exactly that.

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

回答 1

data = """a,b,c
d,e,f
g,h,i
j,k,l"""

print(data.split())       # ['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

str.split,默认情况下,按所有空白字符分割。如果实际的字符串还有其他空格字符,则可能要使用

print(data.split("\n"))   # ['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

或者按照评论中的@Ashwini Chaudhary的建议,可以使用

print(data.splitlines())
data = """a,b,c
d,e,f
g,h,i
j,k,l"""

print(data.split())       # ['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

str.split, by default, splits by all the whitespace characters. If the actual string has any other whitespace characters, you might want to use

print(data.split("\n"))   # ['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

Or as @Ashwini Chaudhary suggested in the comments, you can use

print(data.splitlines())

回答 2

如果只想用换行符分割,最好使用splitlines()

例:

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data
'a,b,c\nd,e,f\ng,h,i\nj,k,l'
>>> data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

使用split()它也可以:

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data
'a,b,c\nd,e,f\ng,h,i\nj,k,l'
>>> data.split()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

然而:

>>> data = """
... a, eqw, qwe
... v, ewr, err
... """
>>> data
'\na, eqw, qwe\nv, ewr, err\n'
>>> data.split()
['a,', 'eqw,', 'qwe', 'v,', 'ewr,', 'err']

If you want to split only by newlines, you can use str.splitlines():

Example:

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data
'a,b,c\nd,e,f\ng,h,i\nj,k,l'
>>> data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

With str.split() your case also works:

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data
'a,b,c\nd,e,f\ng,h,i\nj,k,l'
>>> data.split()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

However if you have spaces (or tabs) it will fail:

>>> data = """
... a, eqw, qwe
... v, ewr, err
... """
>>> data
'\na, eqw, qwe\nv, ewr, err\n'
>>> data.split()
['a,', 'eqw,', 'qwe', 'v,', 'ewr,', 'err']

回答 3

有专门用于此目的的方法:

data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

There is a method specifically for this purpose:

data.splitlines()
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']

回答 4

干得好:

>>> data = """a,b,c
d,e,f
g,h,i
j,k,l"""
>>> data.split()  # split automatically splits through \n and space
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']
>>> 

Here you go:

>>> data = """a,b,c
d,e,f
g,h,i
j,k,l"""
>>> data.split()  # split automatically splits through \n and space
['a,b,c', 'd,e,f', 'g,h,i', 'j,k,l']
>>>