标签归档:string

str.startswith带有要测试的字符串列表

问题:str.startswith带有要测试的字符串列表

我试图避免使用那么多的if语句和比较,而只使用一个列表,但不确定如何将其用于str.startswith

if link.lower().startswith("js/") or link.lower().startswith("catalog/") or link.lower().startswith("script/") or link.lower().startswith("scripts/") or link.lower().startswith("katalog/"):
    # then "do something"

我希望它是:

if link.lower().startswith() in ["js","catalog","script","scripts","katalog"]:
    # then "do something"

任何帮助,将不胜感激。

I’m trying to avoid using so many if statements and comparisons and simply use a list, but not sure how to use it with str.startswith:

if link.lower().startswith("js/") or link.lower().startswith("catalog/") or link.lower().startswith("script/") or link.lower().startswith("scripts/") or link.lower().startswith("katalog/"):
    # then "do something"

What I would like it to be is:

if link.lower().startswith() in ["js","catalog","script","scripts","katalog"]:
    # then "do something"

Any help would be appreciated.


回答 0

str.startswith 允许您提供一个字符串元组来测试:

if link.lower().startswith(("js", "catalog", "script", "katalog")):

文档

str.startswith(prefix[, start[, end]])

返回True如果字符串开始用prefix,否则返回Falseprefix也可以是要查找的前缀的元组。

下面是一个演示:

>>> "abcde".startswith(("xyz", "abc"))
True
>>> prefixes = ["xyz", "abc"]
>>> "abcde".startswith(tuple(prefixes)) # You must use a tuple though
True
>>>

str.startswith allows you to supply a tuple of strings to test for:

if link.lower().startswith(("js", "catalog", "script", "katalog")):

From the docs:

str.startswith(prefix[, start[, end]])

Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for.

Below is a demonstration:

>>> "abcde".startswith(("xyz", "abc"))
True
>>> prefixes = ["xyz", "abc"]
>>> "abcde".startswith(tuple(prefixes)) # You must use a tuple though
True
>>>

回答 1

你也可以使用any()map()就像这样:

if any(map(l.startswith, x)):
    pass # Do something

或者,使用生成器表达式

if any(l.startswith(s) for s in x)
    pass # Do something

You can also use any(), map() like so:

if any(map(l.startswith, x)):
    pass # Do something

Or alternatively, using a generator expression:

if any(l.startswith(s) for s in x)
    pass # Do something

编码/解码有什么区别?

问题:编码/解码有什么区别?

我从来不确定我了解str / unicode解码和编码之间的区别。

我知道这str().decode()是针对当您有一个字节字符串,并且您知道该字符串具有某种字符编码时,给定该编码名称,它将返回一个unicode字符串。

我知道unicode().encode()根据给定的编码名称将Unicode字符转换为字节字符串。

但我不明白是什么str().encode()以及unicode().decode()是。有人可以解释,也可以更正我在上面遇到的其他错误吗?

编辑:

有几个答案给出了.encode有关字符串处理内容的信息,但似乎没人知道.decodeUnicode的处理内容。

I’ve never been sure that I understand the difference between str/unicode decode and encode.

I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.

I know that unicode().encode() converts unicode chars into a string of bytes according to a given encoding name.

But I don’t understand what str().encode() and unicode().decode() are for. Can anyone explain, and possibly also correct anything else I’ve gotten wrong above?

EDIT:

Several answers give info on what .encode does on a string, but no-one seems to know what .decode does for unicode.


回答 0

decodeUnicode字符串的方法实际上根本没有任何应用程序(除非出于某种原因在Unicode字符串中包含一些非文本数据,请参见下文)。我认为主要是出于历史原因。在Python 3中,它完全消失了。

unicode().decode()将执行隐式编码s使用默认(ASCII)编解码器。像这样验证:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

错误消息是完全相同的。

对于str().encode()它周围的其他方法-它试图隐式解码s默认编码方式:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

这样使用,str().encode()也是多余的。

但是后一种方法的另一个应用很有用:有些编码与字符集无关,因此可以有意义的方式应用于8位字符串:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

但是,您是对的:这两个应用程序对“编码”的模棱两可用法令人生厌。同样,在Python 3中使用单独bytestring类型,这不再是问题。

The decode method of unicode strings really doesn’t have any applications at all (unless you have some non-text data in a unicode string for some reason — see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it’s the other way around — it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of “encoding” for both these applications is… awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.


回答 1

将unicode字符串表示为字节字符串被​​称为encoding。使用u'...'.encode(encoding)

例:

    >>>u'æøå'.encode('utf8')
    '\ xc3 \ x83 \ xc2 \ xa6 \ xc3 \ x83 \ xc2 \ xb8 \ xc3 \ x83 \ xc2 \ xa5'
    >>>u'æøå'.encode('latin1')
    '\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
    >>>u'æøå'.encode('ascii')
    UnicodeEncodeError:'ascii'编解码器无法编码位置0-5处的字符: 
    序数不在范围内(128)

通常,在需要将unicode字符串用于IO(例如,通过网络传输它或将其保存到磁盘文件)时,通常会对其进行编码。

将字节字符串转换为unicode字符串称为解码。使用unicode('...', encoding)或’…’。decode(encoding)。

例:

   >>>u'æøå'
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'#解释程序将这样打印unicode对象
   >>> unicode('\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5','latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
   >>>'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'.decode('latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'

通常,每当您从网络或磁盘文件接收到字符串数据时,就对字节字符串进行解码。

我相信python 3的unicode处理方式有所变化,因此以上内容可能不适用于python 3。

一些好的链接:

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or ‘…’.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:


回答 2

Unicode。encode(’encoding’)产生一个字符串对象,并且可以在unicode对象上调用

aString。解码(“编码”)产生一个unicode对象,可以在以给定编码方式编码的字符串上调用。


一些更多的解释:

您可以创建一些未设置任何编码的unicode对象。Python将其存储在内存中的方式与您无关。您可以对其进行搜索,拆分并调用您喜欢的任何字符串操作函数。

但是有时候,您想将unicode对象打印为控制台或某些文本文件。因此,您必须对其进行编码(例如-在UTF-8中),调用encode(’utf-8’),然后会得到一个带有’\ u <someNumber>’的字符串,该字符串可完美打印。

然后,再次(您想做相反的事情)读取以UTF-8编码的字符串并将其视为Unicode,因此\ u360将是一个字符,而不是5。然后解码一个字符串(使用选定的编码),然后获取unicode类型的全新对象。

恰如其分-您可以选择一些变态编码,例如’zip’,’base64’,’rot’,其中一些会在字符串之间转换,但是我认为最常见的情况是涉及UTF-8 / UTF-16和字符串。

anUnicode.encode(‘encoding’) results in a string object and can be called on a unicode object

aString.decode(‘encoding’) results in an unicode object and can be called on a string, encoded in given encoding.


Some more explanations:

You can create some unicode object, which doesn’t have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you’d like to print your unicode object to console or into some text file. So you have to encode it (for example – in UTF-8), you call encode(‘utf-8’) and you get a string with ‘\u<someNumber>’ inside, which is perfectly printable.

Then, again – you’d like to do the opposite – read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note – you can select some pervert encoding, like ‘zip’, ‘base64’, ‘rot’ and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.


回答 3

mybytestring.encode(somecodec)对于以下值有意义somecodec

  • base64
  • bz2
  • zlib
  • 十六进制
  • 夸普里
  • 腐烂13
  • string_escape
  • u

我不确定解码已解码的unicode文本适合什么。尝试使用任何编码似乎总是先尝试使用系统的默认编码进行编码。

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

  • base64
  • bz2
  • zlib
  • hex
  • quopri
  • rot13
  • string_escape
  • uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system’s default encoding first.


回答 4

有几种编码可用于从str到str或从unicode到unicode解码/编码。例如base64,hex甚至rot13。它们在编解码器模块中列出。

编辑:

Unicode字符串上的解码消息可以撤消相应的编码操作:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

返回的类型是str而不是unicode,我认为这很不幸。但是,当您没有在str和unicode之间进行适当的编码/解码时,无论如何这看起来都是一团糟。

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.


回答 5

简单的答案是它们彼此完全相反。

计算机使用字节的最基本单位来存储和处理信息。这对人眼毫无意义。

例如,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\中文单词,在这种情况下,它是“ utf-8”字典,如果您查看其他或错误的字典(使用其他解码方法),它将无法正确显示预期的中文单词。

在上述情况下,计算机查找中文单词的过程为decode()

并且计算机将中文写入计算机存储器的过程是encode()

因此,编码信息是原始字节,解码信息是原始字节和要引用的字典的名称(但不是字典本身)。

The simple answer is that they are the exact opposite of each other.

The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.

For example,’\xe4\xb8\xad\xe6\x96\x87′ is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a “utf-8” dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).

In the above case, the process for a computer to look for Chinese word is decode().

And the process of computer writing the Chinese into computer memory is encode().

So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).


为什么Python的原始字符串文字不能以单个反斜杠结尾?

问题:为什么Python的原始字符串文字不能以单个反斜杠结尾?

从技术上讲,文档中描述了任意数量的反斜杠。

>>> r'\'
  File "<stdin>", line 1
    r'\'
       ^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
  File "<stdin>", line 1
    r'\\\'
         ^
SyntaxError: EOL while scanning string literal

似乎解析器可以将原始字符串中的反斜杠视为常规字符(这不是原始字符串的全部含义吗?),但是我可能缺少明显的东西。

Technically, any odd number of backslashes, as described in the documentation.

>>> r'\'
  File "<stdin>", line 1
    r'\'
       ^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
  File "<stdin>", line 1
    r'\\\'
         ^
SyntaxError: EOL while scanning string literal

It seems like the parser could just treat backslashes in raw strings as regular characters (isn’t that what raw strings are all about?), but I’m probably missing something obvious.


回答 0

我在该部分中以粗体突出显示了原因:

字符串引号可以使用反斜杠转义,但反斜杠仍保留在字符串中;例如,r"\""是由两个字符组成的有效字符串文字:反斜杠和双引号;r"\"不是有效的字符串文字(即使是原始字符串也不能以奇数个反斜杠结尾)。特别是,原始字符串不能以单个反斜杠结尾(因为反斜杠会转义以下引号字符)。还请注意,单个反斜杠后跟换行符将被解释为这两个字符是字符串的一部分,而不是换行符。

因此,原始字符串不是100%原始的,仍然存在一些基本的反斜杠处理。

The reason is explained in the part of that section which I highlighted in bold:

String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

So raw strings are not 100% raw, there is still some rudimentary backslash-processing.


回答 1

关于python原始字符串的整个误解是,大多数人都认为反斜杠(在原始字符串内)与其他所有字符一样都是常规字符。它不是。要了解的关键是此python的教程序列:

当存在’ r ‘或’ R ‘前缀时,字符串中包含反斜杠后的字符而无需更改,并且所有反斜杠都保留在字符串中

因此,反斜杠后面的任何字符都是原始字符串的一部分。解析器输入原始字符串(非Unicode字符串)并遇到反斜杠后,便知道存在2个字符(紧随其后的是反斜杠和char)。

这条路:

r’abc \ d’包含a,b,c,\,d

r’abc \’d’包含a,b,c,\,’,d

r’abc \”包括a,b,c,\,’

和:

r’abc \’包含a,b,c,\,’,但现在没有终止引号。

最后一种情况表明,根据文档,解析器现在找不到结尾的引号,因为您在上面看到的最后一个引号是字符串的一部分,即反斜杠不能在此结尾,因为它将“吞噬”字符串的结尾字符。

The whole misconception about python’s raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python’s tutorial sequence:

When an ‘r‘ or ‘R‘ prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string

So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).

This way:

r’abc\d’ comprises a, b, c, \, d

r’abc\’d’ comprises a, b, c, \, ‘, d

r’abc\” comprises a, b, c, \, ‘

and:

r’abc\’ comprises a, b, c, \, ‘ but there is no terminating quote now.

Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will ‘devour’ string closing char.


回答 2

它就是这样儿的!我将其视为python中的那些小缺陷之一!

我认为没有充分的理由,但绝对不是要解析。使用\作为最后符来解析原始字符串真的很容易。

问题是,如果您允许\成为原始字符串中的最后符,那么您将无法在原始字符串中放入“。”似乎python允许使用“而不是将\作为最后符。

但是,这不会造成任何麻烦。

如果您担心无法轻松地编写Windows文件夹路径(例如,c:\mypath\然后不用担心),则可以将它们表示为r"C:\mypath",并且,如果需要附加子目录名称,请不要使用字符串串联来实现,无论如何,这不是正确的方法!用os.path.join

>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'

That’s the way it is! I see it as one of those small defects in python!

I don’t think there’s a good reason for it, but it’s definitely not parsing; it’s really easy to parse raw strings with \ as a last character.

The catch is, if you allow \ to be the last character in a raw string then you won’t be able to put ” inside a raw string. It seems python went with allowing ” instead of allowing \ as the last character.

However, this shouldn’t cause any trouble.

If you’re worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don’t do it with string concatenation, for it’s not the right way to do it anyway! use os.path.join

>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'

回答 3

为了使您的原始字符串以斜杠结尾,我建议您可以使用以下技巧:

>>> print r"c:\test"'\\'
test\

In order for you to end a raw string with a slash I suggest you can use this trick:

>>> print r"c:\test"'\\'
test\

回答 4

另一个技巧是在计算结果为“ \”时使用chr(92)。

最近,我不得不清理一串反斜线,而以下方法可以解决问题:

CleanString = DirtyString.replace(chr(92),'')

我意识到这并不能解决“为什么”的问题,但是线程吸引了许多人寻找解决当前问题的方法。

Another trick is to use chr(92) as it evaluates to “\”.

I recently had to clean a string of backslashes and the following did the trick:

CleanString = DirtyString.replace(chr(92),'')

I realize that this does not take care of the “why” but the thread attracts many people looking for a solution to an immediate problem.


回答 5

由于原始字符串中允许使用\“。因此不能用于标识字符串文字的结尾。

为什么在遇到第一个“”时不停止解析字符串文字?

如果真是这样,那么在字符串文字中将不允许使用“”。

Since \” is allowed inside the raw string. Then it can’t be used to identify the end of the string literal.

Why not stop parsing the string literal when you encounter the first “?

If that was the case, then \” wouldn’t be allowed inside the string literal. But it is.


回答 6

r'\'语法错误的原因是,尽管字符串表达式是原始的,但使用的引号(单引号或双引号)始终必须转义,否则它们会标记引号的结尾。因此,如果您想在单引号引起来的字符串中表达单引号,则没有其他方法可以使用\'。同样适用于双引号。

但是您可以使用:

'\\'

The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.

But you could use:

'\\'

回答 7

此后删除了答案的另一位用户(不确定是否要记入他们的答案)建议,Python语言设计人员可以通过使用相同的解析规则并将转义的字符扩展为原始格式来简化解析器设计。 (如果文字被标记为原始)。

我认为这是一个有趣的想法,并将其作为后代社区Wiki包含在内。

Another user who has since deleted their answer (not sure if they’d like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).

I thought it was an interesting idea and am including it as community wiki for posterity.


回答 8

尽管其作用很大,但即使是原始字符串也不能以单个反斜杠结尾,因为反斜杠会转义以下引号字符—您仍必须先转义周围的引号字符才能将其嵌入到字符串中。也就是说,r“ … \”不是有效的字符串文字-原始字符串不能以奇数个反斜杠结尾。
如果需要用单个反斜杠结束原始字符串,则可以使用两个反斜杠。

Despite its role, even a raw string cannot end in a single backslash, because the backslash escapes the following quote character—you still must escape the surrounding quote character to embed it in the string. That is, r”…\” is not a valid string literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use two and slice off the second.


回答 9

从C来看,对我来说很清楚,单个\用作转义符,允许您将特殊字符(例如换行符,制表符和引号)放入字符串中。

确实确实不允许\作为最后符,因为它将逃脱“并使解析器阻塞。但是如前所述,\是合法的。

Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.

That does indeed disallow \ as last character since it will escape the ” and make the parser choke. But as pointed out earlier \ is legal.


回答 10

一些技巧 :

1)如果您需要为路径操纵反斜杠,则标准python模块os.path是您的朋友。例如 :

os.path.normpath(’c:/ folder1 /’)

2)如果您要构建的字符串中带有反斜杠,但字符串末尾没有反斜杠,那么原始字符串就是您的朋友(在文字字符串前使用’r’前缀)。例如 :

r'\one \two \three'

3)如果您需要为变量X中的字符串加上反斜杠作为前缀,则可以执行以下操作:

X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X  # X2 now contains \dummy

4)如果您需要创建一个结尾处带有反斜杠的字符串,则结合技巧2和3:

voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name

现在lilypond_statement包含 "\DisplayLilyMusic \upper"

Python万岁!:)

n3on

some tips :

1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :

os.path.normpath(‘c:/folder1/’)

2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use ‘r’ prefix before your literal string). for example :

r'\one \two \three'

3) if you need to prefix a string in a variable X with a backslash then you can do this :

X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X  # X2 now contains \dummy

4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :

voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name

now lilypond_statement contains "\DisplayLilyMusic \upper"

long live python ! :)

n3on


回答 11

我遇到了这个问题,并找到了部分解决方案,在某些情况下是好的。尽管python无法以单个反斜杠结束字符串,但是可以将其序列化并保存在文本文件中,并以单个反斜杠结尾。因此,如果您需要在计算机上保存带有单个反斜杠的文本,则可以:

x = 'a string\\' 
x
'a string\\' 

# Now save it in a text file and it will appear with a single backslash:

with open("my_file.txt", 'w') as h:
    h.write(x)

顺便说一句,如果您使用python的json库转储它,它就不能与json一起使用。

最后,我使用了Spyder,我注意到,如果我在Spider的文本编辑器中通过在变量资源管理器中双击其名称来打开该变量,则该变量将带有一个反斜杠,并且可以通过这种方式复制到剪贴板(不是对大多数需求都非常有帮助,但也许对某些人很有帮助。)

I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:

x = 'a string\\' 
x
'a string\\' 

# Now save it in a text file and it will appear with a single backslash:

with open("my_file.txt", 'w') as h:
    h.write(x)

BTW it is not working with json if you dump it using python’s json library.

Finally, I work with Spyder, and I noticed that if I open the variable in spider’s text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it’s not very helpful for most needs but maybe for some..).


如何在Python中删除前导空格?

问题:如何在Python中删除前导空格?

我有一个以多个空格开头的文本字符串,介于2和4之间。

删除前导空格的最简单方法是什么?(即删除某个字符之前的所有内容?)

"  Example"   -> "Example"
"  Example  " -> "Example  "
"    Example" -> "Example"

I have a text string that starts with a number of spaces, varying between 2 & 4.

What is the simplest way to remove the leading whitespace? (ie. remove everything before a certain character?)

"  Example"   -> "Example"
"  Example  " -> "Example  "
"    Example" -> "Example"

回答 0

lstrip()方法将删除以字符串开头的前导空格,换行符和制表符:

>>> '     hello world!'.lstrip()
'hello world!'

编辑

正如balpha在注释中指出的那样,为了仅从字符串开头删除空格,lstrip(' ')应使用:

>>> '   hello world with 2 spaces and a tab!'.lstrip(' ')
'\thello world with 2 spaces and a tab!'

相关问题:

The lstrip() method will remove leading whitespaces, newline and tab characters on a string beginning:

>>> '     hello world!'.lstrip()
'hello world!'

Edit

As balpha pointed out in the comments, in order to remove only spaces from the beginning of the string, lstrip(' ') should be used:

>>> '   hello world with 2 spaces and a tab!'.lstrip(' ')
'\thello world with 2 spaces and a tab!'

Related question:


回答 1

该函数strip将从字符串的开头和结尾删除空格。

my_str = "   text "
my_str = my_str.strip()

将设置my_str"text"

The function strip will remove whitespace from the beginning and end of a string.

my_str = "   text "
my_str = my_str.strip()

will set my_str to "text".


回答 2

如果要剪切单词前后的空格,请保留中间的空格。
您可以使用:

word = '  Hello World  '
stripped = word.strip()
print(stripped)

If you want to cut the whitespaces before and behind the word, but keep the middle ones.
You could use:

word = '  Hello World  '
stripped = word.strip()
print(stripped)

回答 3

要删除某个字符之前的所有内容,请使用正则表达式:

re.sub(r'^[^a]*', '')

删除所有内容,直到第一个“ a”。[^a]可以替换为您喜欢的任何字符类,例如单词字符。

To remove everything before a certain character, use a regular expression:

re.sub(r'^[^a]*', '')

to remove everything up to the first ‘a’. [^a] can be replaced with any character class you like, such as word characters.


回答 4

这个问题不会解决多行字符串,但是这是如何使用python的标准库textwrap模块从多行字符串中去除前导空格。如果我们有一个像这样的字符串:

s = """
    line 1 has 4 leading spaces
    line 2 has 4 leading spaces
    line 3 has 4 leading spaces
"""

如果我们print(s)得到如下输出:

>>> print(s)
    this has 4 leading spaces 1
    this has 4 leading spaces 2
    this has 4 leading spaces 3

如果我们使用了textwrap.dedent

>>> import textwrap
>>> print(textwrap.dedent(s))
this has 4 leading spaces 1
this has 4 leading spaces 2
this has 4 leading spaces 3

The question doesn’t address multiline strings, but here is how you would strip leading whitespace from a multiline string using python’s standard library textwrap module. If we had a string like:

s = """
    line 1 has 4 leading spaces
    line 2 has 4 leading spaces
    line 3 has 4 leading spaces
"""

if we print(s) we would get output like:

>>> print(s)
    this has 4 leading spaces 1
    this has 4 leading spaces 2
    this has 4 leading spaces 3

and if we used textwrap.dedent:

>>> import textwrap
>>> print(textwrap.dedent(s))
this has 4 leading spaces 1
this has 4 leading spaces 2
this has 4 leading spaces 3

如何在Python3中将“二进制字符串”转换为普通字符串?

问题:如何在Python3中将“二进制字符串”转换为普通字符串?

例如,我有一个像这样的字符串(返回值subprocess.check_output):

>>> b'a string'
b'a string'

无论我对它做了什么,它总是b'在字符串之前印有烦人的字样:

>>> print(b'a string')
b'a string'
>>> print(str(b'a string'))
b'a string'

是否有人对如何将其用作普通字符串或将其转换为普通字符串有任何想法?

For example, I have a string like this(return value of subprocess.check_output):

>>> b'a string'
b'a string'

Whatever I did to it, it is always printed with the annoying b' before the string:

>>> print(b'a string')
b'a string'
>>> print(str(b'a string'))
b'a string'

Does anyone have any ideas about how to use it as a normal string or convert it into a normal string?


回答 0

解码它。

>>> b'a string'.decode('ascii')
'a string'

要从字符串获取字节,请对其进行编码。

>>> 'a string'.encode('ascii')
b'a string'

Decode it.

>>> b'a string'.decode('ascii')
'a string'

To get bytes from string, encode it.

>>> 'a string'.encode('ascii')
b'a string'

回答 1

如果来自falsetru的答案不起作用,您还可以尝试:

>>> b'a string'.decode('utf-8')
'a string'

If the answer from falsetru didn’t work you could also try:

>>> b'a string'.decode('utf-8')
'a string'

回答 2

请参阅图书馆的官方资料encode()decode()文档codecsutf-8是函数的默认编码,但是Python 3中有几种标准编码,例如latin_1utf_32

Please, see oficial encode() and decode() documentation from codecs library. utf-8 is the default encoding for the functions, but there are severals standard encodings in Python 3, like latin_1 or utf_32.


如何将变量放在字符串中?

问题:如何将变量放在字符串中?

我想int放入一个string。这是我目前正在做的事情:

num = 40
plot.savefig('hanning40.pdf') #problem line

我必须为几个不同的数字运行程序,所以我想做一个循环。但是像这样插入变量不起作用:

plot.savefig('hanning', num, '.pdf')

如何在Python字符串中插入变量?

I would like to put an int into a string. This is what I am doing at the moment:

num = 40
plot.savefig('hanning40.pdf') #problem line

I have to run the program for several different numbers, so I’d like to do a loop. But inserting the variable like this doesn’t work:

plot.savefig('hanning', num, '.pdf')

How do I insert a variable into a Python string?


回答 0

plot.savefig('hanning(%d).pdf' % num)

%运营商,下面的字符串时,允许你插入值到通过格式代码的字符串(%d在这种情况下)。有关更多详细信息,请参见Python文档:

https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting

plot.savefig('hanning(%d).pdf' % num)

The % operator, when following a string, allows you to insert values into that string via format codes (the %d in this case). For more details, see the Python documentation:

https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting


回答 1

哦,很多很多方式…

字符串串联:

plot.savefig('hanning' + str(num) + '.pdf')

转换说明符:

plot.savefig('hanning%s.pdf' % num)

使用局部变量名:

plot.savefig('hanning%(num)s.pdf' % locals()) # Neat trick

使用str.format()

plot.savefig('hanning{0}.pdf'.format(num)) # Note: This is the new preferred way

使用f字符串:

plot.savefig(f'hanning{num}.pdf') # added in Python 3.6

使用string.Template

plot.savefig(string.Template('hanning${num}.pdf').substitute(locals()))

Oh, the many, many ways…

String concatenation:

plot.savefig('hanning' + str(num) + '.pdf')

Conversion Specifier:

plot.savefig('hanning%s.pdf' % num)

Using local variable names:

plot.savefig('hanning%(num)s.pdf' % locals()) # Neat trick

Using str.format():

plot.savefig('hanning{0}.pdf'.format(num)) # Note: This is the new preferred way

Using f-strings:

plot.savefig(f'hanning{num}.pdf') # added in Python 3.6

Using string.Template:

plot.savefig(string.Template('hanning${num}.pdf').substitute(locals()))

回答 2

通过在Python 3.6中引入格式化的字符串文字(简称为“ f-strings”),现在可以使用更简短的语法编写该文字了:

>>> name = "Fred"
>>> f"He said his name is {name}."
'He said his name is Fred.'

通过问题中给出的示例,它看起来像这样

plot.savefig(f'hanning{num}.pdf')

With the introduction of formatted string literals (“f-strings” for short) in Python 3.6, it is now possible to write this with a briefer syntax:

>>> name = "Fred"
>>> f"He said his name is {name}."
'He said his name is Fred.'

With the example given in the question, it would look like this

plot.savefig(f'hanning{num}.pdf')

回答 3

不确定您发布的所有代码到底做什么,但是要回答标题中提出的问题,您可以将+用作常规字符串concat函数以及str()。

"hello " + str(10) + " world" = "hello 10 world"

希望有帮助!

Not sure exactly what all the code you posted does, but to answer the question posed in the title, you can use + as the normal string concat function as well as str().

"hello " + str(10) + " world" = "hello 10 world"

Hope that helps!


回答 4

通常,您可以使用以下命令创建字符串:

stringExample = "someString " + str(someNumber)
print(stringExample)
plot.savefig(stringExample)

In general, you can create strings using:

stringExample = "someString " + str(someNumber)
print(stringExample)
plot.savefig(stringExample)

回答 5

如果您想将多个值放入字符串中,则可以使用 format

nums = [1,2,3]
plot.savefig('hanning{0}{1}{2}.pdf'.format(*nums))

将导致字符串hanning123.pdf。可以使用任何数组来完成。

If you would want to put multiple values into the string you could make use of format

nums = [1,2,3]
plot.savefig('hanning{0}{1}{2}.pdf'.format(*nums))

Would result in the string hanning123.pdf. This can be done with any array.


回答 6

我需要一个扩展版本:我不需要在字符串中嵌入单个数字,而是需要生成一系列格式为’file1.pdf’,’file2.pdf’等的文件名。这就是它的方式工作:

['file' + str(i) + '.pdf' for i in range(1,4)]

I had a need for an extended version of this: instead of embedding a single number in a string, I needed to generate a series of file names of the form ‘file1.pdf’, ‘file2.pdf’ etc. This is how it worked:

['file' + str(i) + '.pdf' for i in range(1,4)]

回答 7

您只需要使用以下命令将num变量转换为字符串

str(num)

You just have to cast the num varriable into a string using

str(num)

从字符串中删除所有特殊字符,标点和空格

问题:从字符串中删除所有特殊字符,标点和空格

我需要从字符串中删除所有特殊字符,标点符号和空格,以便只有字母和数字。

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.


回答 0

这可以不用正则表达式来完成:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

您可以使用str.isalnum

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

如果您坚持使用正则表达式,则其他解决方案也可以。但是请注意,如果可以在不使用正则表达式的情况下完成此操作,那么这是最好的解决方法。

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that’s the best way to go about it.


回答 1

这是一个正则表达式,用于匹配不是字母或数字的字符串:

[^A-Za-z0-9]+

这是执行正则表达式替换的Python命令:

re.sub('[^A-Za-z0-9]+', '', mystring)

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

回答 2

较短的方法:

import re
cleanString = re.sub('\W+','', string )

如果要在单词和数字之间留空格,请用”代替’

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute ” with ‘ ‘


回答 3

看到这一点之后,我有兴趣通过找出执行时间最短的方法来扩展所提供的答案,因此我仔细检查了一些建议的答案,并timeit对照了两个示例字符串:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

例子1

'.join(e for e in string if e.isalnum())

  • string1 -结果:10.7061979771
  • string2 -结果:7.77832597694

例子2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 -结果:7.10785102844
  • string2 -结果:4.12814903259

例子3

import re re.sub('\W+','', string)

  • string1 -结果:3.11899876595
  • string2 -结果:2.78014397621

以上结果是以下平均值的最低返回结果的乘积: repeat(3, 2000000)

示例3的速度可以比示例13倍。

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

Example 1

'.join(e for e in string if e.isalnum())

  • string1 – Result: 10.7061979771
  • string2 – Result: 7.78372597694

Example 2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 – Result: 7.10785102844
  • string2 – Result: 4.12814903259

Example 3

import re re.sub('\W+','', string)

  • string1 – Result: 3.11899876595
  • string2 – Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.


回答 4

Python 2. *

我认为filter(str.isalnum, string)效果很好

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3. *

在Python3中,filter( )函数将返回一个可迭代的对象(而不是上面的字符串)。必须重新加入以从itertable中获取字符串:

''.join(filter(str.isalnum, string)) 

或通过list加入使用(不确定,但可以很快

''.join([*filter(str.isalnum, string)])

注意:[*args]Python> = 3.5中解压缩有效

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string)) 

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5


回答 5

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

您可以添加更多特殊字符,然后将其替换为“”,则表示没有任何意义,即它们将被删除。

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by ” means nothing i.e they will be removed.


回答 6

与使用正则表达式的其他所有人不同,我将尝试排除想要的每个字符,而不是明确枚举不需要的字符。

例如,如果我只需要’a到z’字符(大写和小写)和数字,我将排除所有其他内容:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

这意味着“用空字符串替换每个不是数字的字符,或者用’a到z’或’A到Z’范围内的字符代替”。

实际上,如果^在正则表达式的第一位插入特殊字符,则会得到否定。

额外提示:如果您还需要将结果小写,则可以使正则表达式更快,更轻松,只要您现在找不到大写即可。

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don’t want.

For example, if I want only characters from ‘a to z’ (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means “substitute every character that is not a number, or a character in the range ‘a to z’ or ‘A to Z’ with an empty string”.

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won’t find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

回答 7

假设您要使用正则表达式,并且想要/需要支持2to3的Unicode识别2.x代码:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

回答 8

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)
s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

回答 9

最通用的方法是使用unicodedata表的“类别”,该表对每个单个字符进行分类。例如,以下代码根据其类别仅过滤可打印字符:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

查看上面所有相关类别的给定URL。当然,您也可以按标点符号类别进行过滤。

The most generic approach is using the ‘categories’ of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.


回答 10

string。标点符号包含以下字符:

‘!“#$%&\’()* +,-。/ :; <=>?@ [\] ^ _`{|}〜’

您可以使用translate和maketrans函数将标点符号映射到空值(替换)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

输出:

'This is A test'

string.punctuation contains following characters:

‘!”#$%&\'()*+,-./:;<=>?@[\]^_`{|}~’

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

回答 11

使用翻译:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

警告:仅适用于ASCII字符串。

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.


回答 12

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

与双引号相同。“”“

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

same as double quotes.”””

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

回答 13

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

您将看到的结果为

‘askhnlaskdjalsdk

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

‘askhnlaskdjalsdk


回答 14

删除标点,数字和特殊字符

例子:-

在此处输入图片说明

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

结果:- 在此处输入图片说明

谢谢 :)

Removing Punctuations, Numbers, and Special Characters

Example :-

enter image description here

Code

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

Result:- enter image description here

Thanks :)


从字符串创建Pandas DataFrame

问题:从字符串创建Pandas DataFrame

为了测试某些功能,我想DataFrame从字符串创建一个。假设我的测试数据如下:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

将数据读入熊猫的最简单方法是什么DataFrame

In order to test some functionality I would like to create a DataFrame from a string. Let’s say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?


回答 0

一种简单的方法是使用StringIO.StringIO(python2)io.StringIO(python3)并将其传递给pandas.read_csv函数。例如:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

回答 1

分割法

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

回答 2

交互式工作的快速简便解决方案是通过从剪贴板加载数据来复制和粘贴文本。

用鼠标选择字符串的内容:

复制数据以粘贴到Pandas数据框中

在Python Shell中使用 read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

使用适当的分隔符:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

Copy data for pasting into a Pandas dataframe

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

回答 3

传统的可变宽度CSV无法将数据存储为字符串变量。尤其是在.py文件内部使用时,请考虑使用定宽管道分隔数据。各种IDE和编辑器可能都有一个插件,用于将管道分隔的文本格式化为整齐的表。

使用 read_csv

将以下内容存储在实用程序模块中,例如util/pandas.py。函数的文档字符串中包含一个示例。

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

非工作选择

以下代码无法正常运行,因为它在左侧和右侧都添加了一个空列。

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

至于read_fwf,它实际上并没有使用太多read_csv接受和使用的可选kwarg 。因此,它根本不应该用于管道分隔的数据。

This answer applies when a string is manually entered, not when it’s read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using read_csv

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function’s docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn’t work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn’t actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn’t be used at all for pipe-separated data.


回答 4

最简单的方法是将其保存到临时文件,然后读取它:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

创建临时文件的正确方法:如何在Python中创建tmp文件?

Simplest way is to save it to temp file and then read it:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

Right way of creating temp file: How can I create a tmp file in Python?


将timedelta格式化为字符串

问题:将timedelta格式化为字符串

我在格式化datetime.timedelta对象时遇到问题。

这是我要执行的操作:我有一个对象列表,并且该对象类的成员之一是timedelta对象,该对象显示事件的持续时间。我想以小时:分钟的格式显示该持续时间。

我尝试了多种方法来执行此操作,但遇到了困难。我当前的方法是为返回小时和分钟的对象添加方法。我可以将timedelta.seconds除以3600并四舍五入来获得小时数。我在获取剩余秒数并将其转换为分钟时遇到麻烦。

顺便说一句,我将Google AppEngine与Django模板结合使用进行演示。

I’m having trouble formatting a datetime.timedelta object.

Here’s what I’m trying to do: I have a list of objects and one of the members of the class of the object is a timedelta object that shows the duration of an event. I would like to display that duration in the format of hours:minutes.

I have tried a variety of methods for doing this and I’m having difficulty. My current approach is to add methods to the class for my objects that return hours and minutes. I can get the hours by dividing the timedelta.seconds by 3600 and rounding it. I’m having trouble with getting the remainder seconds and converting that to minutes.

By the way, I’m using Google AppEngine with Django Templates for presentation.


回答 0

您可以使用str()将timedelta转换为字符串。这是一个例子:

import datetime
start = datetime.datetime(2009,2,10,14,00)
end   = datetime.datetime(2009,2,10,16,00)
delta = end-start
print(str(delta))
# prints 2:00:00

You can just convert the timedelta to a string with str(). Here’s an example:

import datetime
start = datetime.datetime(2009,2,10,14,00)
end   = datetime.datetime(2009,2,10,16,00)
delta = end-start
print(str(delta))
# prints 2:00:00

回答 1

如您所知,您可以通过访问.seconds属性从timedelta对象获取total_seconds 。

Python提供了内置函数divmod(),该函数允许:

s = 13420
hours, remainder = divmod(s, 3600)
minutes, seconds = divmod(remainder, 60)
print '{:02}:{:02}:{:02}'.format(int(hours), int(minutes), int(seconds))
# result: 03:43:40

或者您可以将模和减相结合,转换为小时和余数:

# arbitrary number of seconds
s = 13420
# hours
hours = s // 3600 
# remaining seconds
s = s - (hours * 3600)
# minutes
minutes = s // 60
# remaining seconds
seconds = s - (minutes * 60)
# total time
print '{:02}:{:02}:{:02}'.format(int(hours), int(minutes), int(seconds))
# result: 03:43:40

As you know, you can get the total_seconds from a timedelta object by accessing the .seconds attribute.

Python provides the builtin function divmod() which allows for:

s = 13420
hours, remainder = divmod(s, 3600)
minutes, seconds = divmod(remainder, 60)
print '{:02}:{:02}:{:02}'.format(int(hours), int(minutes), int(seconds))
# result: 03:43:40

or you can convert to hours and remainder by using a combination of modulo and subtraction:

# arbitrary number of seconds
s = 13420
# hours
hours = s // 3600 
# remaining seconds
s = s - (hours * 3600)
# minutes
minutes = s // 60
# remaining seconds
seconds = s - (minutes * 60)
# total time
print '{:02}:{:02}:{:02}'.format(int(hours), int(minutes), int(seconds))
# result: 03:43:40

回答 2

>>> str(datetime.timedelta(hours=10.56))
10:33:36

>>> td = datetime.timedelta(hours=10.505) # any timedelta object
>>> ':'.join(str(td).split(':')[:2])
10:30

如果我们简单地输入,将timedelta对象传递给str()函数将调用相同的格式代码print td。由于您不需要秒数,因此我们可以用冒号(3个部分)分割字符串,然后仅将前2个部分放回去。

>>> str(datetime.timedelta(hours=10.56))
10:33:36

>>> td = datetime.timedelta(hours=10.505) # any timedelta object
>>> ':'.join(str(td).split(':')[:2])
10:30

Passing the timedelta object to the str() function calls the same formatting code used if we simply type print td. Since you don’t want the seconds, we can split the string by colons (3 parts) and put it back together with only the first 2 parts.


回答 3

def td_format(td_object):
    seconds = int(td_object.total_seconds())
    periods = [
        ('year',        60*60*24*365),
        ('month',       60*60*24*30),
        ('day',         60*60*24),
        ('hour',        60*60),
        ('minute',      60),
        ('second',      1)
    ]

    strings=[]
    for period_name, period_seconds in periods:
        if seconds > period_seconds:
            period_value , seconds = divmod(seconds, period_seconds)
            has_s = 's' if period_value > 1 else ''
            strings.append("%s %s%s" % (period_value, period_name, has_s))

    return ", ".join(strings)
def td_format(td_object):
    seconds = int(td_object.total_seconds())
    periods = [
        ('year',        60*60*24*365),
        ('month',       60*60*24*30),
        ('day',         60*60*24),
        ('hour',        60*60),
        ('minute',      60),
        ('second',      1)
    ]

    strings=[]
    for period_name, period_seconds in periods:
        if seconds > period_seconds:
            period_value , seconds = divmod(seconds, period_seconds)
            has_s = 's' if period_value > 1 else ''
            strings.append("%s %s%s" % (period_value, period_name, has_s))

    return ", ".join(strings)

回答 4

我个人使用该humanize库:

>>> import datetime
>>> humanize.naturalday(datetime.datetime.now())
'today'
>>> humanize.naturalday(datetime.datetime.now() - datetime.timedelta(days=1))
'yesterday'
>>> humanize.naturalday(datetime.date(2007, 6, 5))
'Jun 05'
>>> humanize.naturaldate(datetime.date(2007, 6, 5))
'Jun 05 2007'
>>> humanize.naturaltime(datetime.datetime.now() - datetime.timedelta(seconds=1))
'a second ago'
>>> humanize.naturaltime(datetime.datetime.now() - datetime.timedelta(seconds=3600))
'an hour ago'

当然,它并不能完全给您为您的答案(的确是,str(timeA - timeB)但是我发现,一旦您超过几个小时,显示就会迅速变得不可读。humanize它支持更大的值易于阅读,并且位置很好。

contrib.humanize显然,它是受Django 模块启发的,因此,由于您使用的是Django,您应该使用它。

I personally use the humanize library for this:

>>> import datetime
>>> humanize.naturalday(datetime.datetime.now())
'today'
>>> humanize.naturalday(datetime.datetime.now() - datetime.timedelta(days=1))
'yesterday'
>>> humanize.naturalday(datetime.date(2007, 6, 5))
'Jun 05'
>>> humanize.naturaldate(datetime.date(2007, 6, 5))
'Jun 05 2007'
>>> humanize.naturaltime(datetime.datetime.now() - datetime.timedelta(seconds=1))
'a second ago'
>>> humanize.naturaltime(datetime.datetime.now() - datetime.timedelta(seconds=3600))
'an hour ago'

Of course, it doesn’t give you exactly the answer you were looking for (which is, indeed, str(timeA - timeB), but I have found that once you go beyond a few hours, the display becomes quickly unreadable. humanize has support for much larger values that are human-readable, and is also well localized.

It’s inspired by Django’s contrib.humanize module, apparently, so since you are using Django, you should probably use that.


回答 5

他已经有一个timedelta对象,所以为什么不使用其内置方法total_seconds()将其转换为秒,然后使用divmod()获得小时和分钟呢?

hours, remainder = divmod(myTimeDelta.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)

# Formatted only for hours and minutes as requested
print '%s:%s' % (hours, minutes)

无论时间增量是偶数天还是数年,这都有效。

He already has a timedelta object so why not use its built-in method total_seconds() to convert it to seconds, then use divmod() to get hours and minutes?

hours, remainder = divmod(myTimeDelta.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)

# Formatted only for hours and minutes as requested
print '%s:%s' % (hours, minutes)

This works regardless if the time delta has even days or years.


回答 6

这是一个通用函数,用于将timedelta对象或常规数字(以秒或分钟等形式)转换为格式正确的字符串。我对一个重复的问题做了mpounsett的出色回答,使其更加灵活,可读性更高,并增加了文档。

您会发现,到目前为止,这是最灵活的答案,因为它可以使您:

  1. 即时自定义字符串格式,而不是对其进行硬编码。
  2. 留出一定的时间间隔没有问题(请参见下面的示例)。

功能:

from string import Formatter
from datetime import timedelta

def strfdelta(tdelta, fmt='{D:02}d {H:02}h {M:02}m {S:02}s', inputtype='timedelta'):
    """Convert a datetime.timedelta object or a regular number to a custom-
    formatted string, just like the stftime() method does for datetime.datetime
    objects.

    The fmt argument allows custom formatting to be specified.  Fields can 
    include seconds, minutes, hours, days, and weeks.  Each field is optional.

    Some examples:
        '{D:02}d {H:02}h {M:02}m {S:02}s' --> '05d 08h 04m 02s' (default)
        '{W}w {D}d {H}:{M:02}:{S:02}'     --> '4w 5d 8:04:02'
        '{D:2}d {H:2}:{M:02}:{S:02}'      --> ' 5d  8:04:02'
        '{H}h {S}s'                       --> '72h 800s'

    The inputtype argument allows tdelta to be a regular number instead of the  
    default, which is a datetime.timedelta object.  Valid inputtype strings: 
        's', 'seconds', 
        'm', 'minutes', 
        'h', 'hours', 
        'd', 'days', 
        'w', 'weeks'
    """

    # Convert tdelta to integer seconds.
    if inputtype == 'timedelta':
        remainder = int(tdelta.total_seconds())
    elif inputtype in ['s', 'seconds']:
        remainder = int(tdelta)
    elif inputtype in ['m', 'minutes']:
        remainder = int(tdelta)*60
    elif inputtype in ['h', 'hours']:
        remainder = int(tdelta)*3600
    elif inputtype in ['d', 'days']:
        remainder = int(tdelta)*86400
    elif inputtype in ['w', 'weeks']:
        remainder = int(tdelta)*604800

    f = Formatter()
    desired_fields = [field_tuple[1] for field_tuple in f.parse(fmt)]
    possible_fields = ('W', 'D', 'H', 'M', 'S')
    constants = {'W': 604800, 'D': 86400, 'H': 3600, 'M': 60, 'S': 1}
    values = {}
    for field in possible_fields:
        if field in desired_fields and field in constants:
            values[field], remainder = divmod(remainder, constants[field])
    return f.format(fmt, **values)

演示:

>>> td = timedelta(days=2, hours=3, minutes=5, seconds=8, microseconds=340)

>>> print strfdelta(td)
02d 03h 05m 08s

>>> print strfdelta(td, '{D}d {H}:{M:02}:{S:02}')
2d 3:05:08

>>> print strfdelta(td, '{D:2}d {H:2}:{M:02}:{S:02}')
 2d  3:05:08

>>> print strfdelta(td, '{H}h {S}s')
51h 308s

>>> print strfdelta(12304, inputtype='s')
00d 03h 25m 04s

>>> print strfdelta(620, '{H}:{M:02}', 'm')
10:20

>>> print strfdelta(49, '{D}d {H}h', 'h')
2d 1h

Here is a general purpose function for converting either a timedelta object or a regular number (in the form of seconds or minutes, etc.) to a nicely formatted string. I took mpounsett’s fantastic answer on a duplicate question, made it a bit more flexible, improved readibility, and added documentation.

You will find that it is the most flexible answer here so far since it allows you to:

  1. Customize the string format on the fly instead of it being hard-coded.
  2. Leave out certain time intervals without a problem (see examples below).

Function:

from string import Formatter
from datetime import timedelta

def strfdelta(tdelta, fmt='{D:02}d {H:02}h {M:02}m {S:02}s', inputtype='timedelta'):
    """Convert a datetime.timedelta object or a regular number to a custom-
    formatted string, just like the stftime() method does for datetime.datetime
    objects.

    The fmt argument allows custom formatting to be specified.  Fields can 
    include seconds, minutes, hours, days, and weeks.  Each field is optional.

    Some examples:
        '{D:02}d {H:02}h {M:02}m {S:02}s' --> '05d 08h 04m 02s' (default)
        '{W}w {D}d {H}:{M:02}:{S:02}'     --> '4w 5d 8:04:02'
        '{D:2}d {H:2}:{M:02}:{S:02}'      --> ' 5d  8:04:02'
        '{H}h {S}s'                       --> '72h 800s'

    The inputtype argument allows tdelta to be a regular number instead of the  
    default, which is a datetime.timedelta object.  Valid inputtype strings: 
        's', 'seconds', 
        'm', 'minutes', 
        'h', 'hours', 
        'd', 'days', 
        'w', 'weeks'
    """

    # Convert tdelta to integer seconds.
    if inputtype == 'timedelta':
        remainder = int(tdelta.total_seconds())
    elif inputtype in ['s', 'seconds']:
        remainder = int(tdelta)
    elif inputtype in ['m', 'minutes']:
        remainder = int(tdelta)*60
    elif inputtype in ['h', 'hours']:
        remainder = int(tdelta)*3600
    elif inputtype in ['d', 'days']:
        remainder = int(tdelta)*86400
    elif inputtype in ['w', 'weeks']:
        remainder = int(tdelta)*604800

    f = Formatter()
    desired_fields = [field_tuple[1] for field_tuple in f.parse(fmt)]
    possible_fields = ('W', 'D', 'H', 'M', 'S')
    constants = {'W': 604800, 'D': 86400, 'H': 3600, 'M': 60, 'S': 1}
    values = {}
    for field in possible_fields:
        if field in desired_fields and field in constants:
            values[field], remainder = divmod(remainder, constants[field])
    return f.format(fmt, **values)

Demo:

>>> td = timedelta(days=2, hours=3, minutes=5, seconds=8, microseconds=340)

>>> print strfdelta(td)
02d 03h 05m 08s

>>> print strfdelta(td, '{D}d {H}:{M:02}:{S:02}')
2d 3:05:08

>>> print strfdelta(td, '{D:2}d {H:2}:{M:02}:{S:02}')
 2d  3:05:08

>>> print strfdelta(td, '{H}h {S}s')
51h 308s

>>> print strfdelta(12304, inputtype='s')
00d 03h 25m 04s

>>> print strfdelta(620, '{H}:{M:02}', 'm')
10:20

>>> print strfdelta(49, '{D}d {H}h', 'h')
2d 1h

回答 7

我知道这是一个古老的已回答问题,但我datetime.utcfromtimestamp()为此使用了。它需要秒数并返回datetime可以像其他格式一样设置的datetime

duration = datetime.utcfromtimestamp(end - begin)
print duration.strftime('%H:%M')

只要您停留在合法的时间范围内,它就应该起作用,即,当小时数小于等于23时,它不会返回1234:35。

I know that this is an old answered question, but I use datetime.utcfromtimestamp() for this. It takes the number of seconds and returns a datetime that can be formatted like any other datetime.

duration = datetime.utcfromtimestamp(end - begin)
print duration.strftime('%H:%M')

As long as you stay in the legal ranges for the time parts this should work, i.e. it doesn’t return 1234:35 as hours are <= 23.


回答 8

发问者想要比典型的更好的格式:

  >>> import datetime
  >>> datetime.timedelta(seconds=41000)
  datetime.timedelta(0, 41000)
  >>> str(datetime.timedelta(seconds=41000))
  '11:23:20'
  >>> str(datetime.timedelta(seconds=4102.33))
  '1:08:22.330000'
  >>> str(datetime.timedelta(seconds=413302.33))
  '4 days, 18:48:22.330000'

因此,实际上有两种格式,一种格式的天数为0,而被忽略,另一种格式的文本为“ n days,h:m:s”。但是,秒可能会有分数,并且打印输出中没有前导零,所以列很乱。

如果您喜欢,这是我的日常工作:

def printNiceTimeDelta(stime, etime):
    delay = datetime.timedelta(seconds=(etime - stime))
    if (delay.days > 0):
        out = str(delay).replace(" days, ", ":")
    else:
        out = "0:" + str(delay)
    outAr = out.split(':')
    outAr = ["%02d" % (int(float(x))) for x in outAr]
    out   = ":".join(outAr)
    return out

这将以dd:hh:mm:ss格式返回输出:

00:00:00:15
00:00:00:19
02:01:31:40
02:01:32:22

我确实考虑过要增加几年,但这留给读者练习,因为输出在1年以上是安全的:

>>> str(datetime.timedelta(seconds=99999999))
'1157 days, 9:46:39'

Questioner wants a nicer format than the typical:

  >>> import datetime
  >>> datetime.timedelta(seconds=41000)
  datetime.timedelta(0, 41000)
  >>> str(datetime.timedelta(seconds=41000))
  '11:23:20'
  >>> str(datetime.timedelta(seconds=4102.33))
  '1:08:22.330000'
  >>> str(datetime.timedelta(seconds=413302.33))
  '4 days, 18:48:22.330000'

So, really there’s two formats, one where days are 0 and it’s left out, and another where there’s text “n days, h:m:s”. But, the seconds may have fractions, and there’s no leading zeroes in the printouts, so columns are messy.

Here’s my routine, if you like it:

def printNiceTimeDelta(stime, etime):
    delay = datetime.timedelta(seconds=(etime - stime))
    if (delay.days > 0):
        out = str(delay).replace(" days, ", ":")
    else:
        out = "0:" + str(delay)
    outAr = out.split(':')
    outAr = ["%02d" % (int(float(x))) for x in outAr]
    out   = ":".join(outAr)
    return out

this returns output as dd:hh:mm:ss format:

00:00:00:15
00:00:00:19
02:01:31:40
02:01:32:22

I did think about adding years to this, but this is left as an exercise for the reader, since the output is safe at over 1 year:

>>> str(datetime.timedelta(seconds=99999999))
'1157 days, 9:46:39'

回答 9

我会在这里认真考虑Occam的Razor方法:

td = str(timedelta).split('.')[0]

这将返回不带微秒的字符串

如果要重新生成datetime.timedelta对象,请执行以下操作:

h,m,s = re.split(':', td)
new_delta = datetime.timedelta(hours=int(h),minutes=int(m),seconds=int(s))

2年过去了,我喜欢这种语言!

I would seriously consider the Occam’s Razor approach here:

td = str(timedelta).split('.')[0]

This returns a string without the microseconds

If you want to regenerate the datetime.timedelta object, just do this:

h,m,s = re.split(':', td)
new_delta = datetime.timedelta(hours=int(h),minutes=int(m),seconds=int(s))

2 years in, I love this language!


回答 10

我的datetime.timedelta物品超过一天。因此,这是另一个问题。以上所有讨论都假设不到一天。A timedelta实际上是天,秒和微秒的元组。上面的讨论应该使用td.seconds像joe一样,但是如果您有工作日,则它不包括在seconds值中。

我得到了2个日期时间与打印天数和小时之间的时间跨度。

span = currentdt - previousdt
print '%d,%d\n' % (span.days,span.seconds/3600)

My datetime.timedelta objects went greater than a day. So here is a further problem. All the discussion above assumes less than a day. A timedelta is actually a tuple of days, seconds and microseconds. The above discussion should use td.seconds as joe did, but if you have days it is NOT included in the seconds value.

I am getting a span of time between 2 datetimes and printing days and hours.

span = currentdt - previousdt
print '%d,%d\n' % (span.days,span.seconds/3600)

回答 11

我使用humanfriendlypython库来做到这一点,它工作得很好。

import humanfriendly
from datetime import timedelta
delta = timedelta(seconds = 321)
humanfriendly.format_timespan(delta)

'5 minutes and 21 seconds'

可在https://pypi.org/project/humanfriendly/

I used the humanfriendly python library to do this, it works very well.

import humanfriendly
from datetime import timedelta
delta = timedelta(seconds = 321)
humanfriendly.format_timespan(delta)

'5 minutes and 21 seconds'

Available at https://pypi.org/project/humanfriendly/


回答 12

遵循上面Joe的示例值,因此,我将使用模数算术运算符:

td = datetime.timedelta(hours=10.56)
td_str = "%d:%d" % (td.seconds/3600, td.seconds%3600/60)

注意,Python中的整数除法默认舍入;如果要更明确,请适当使用math.floor()或math.ceil()。

Following Joe’s example value above, I’d use the modulus arithmetic operator, thusly:

td = datetime.timedelta(hours=10.56)
td_str = "%d:%d" % (td.seconds/3600, td.seconds%3600/60)

Note that integer division in Python rounds down by default; if you want to be more explicit, use math.floor() or math.ceil() as appropriate.


回答 13

def seconds_to_time_left_string(total_seconds):
    s = int(total_seconds)
    years = s // 31104000
    if years > 1:
        return '%d years' % years
    s = s - (years * 31104000)
    months = s // 2592000
    if years == 1:
        r = 'one year'
        if months > 0:
            r += ' and %d months' % months
        return r
    if months > 1:
        return '%d months' % months
    s = s - (months * 2592000)
    days = s // 86400
    if months == 1:
        r = 'one month'
        if days > 0:
            r += ' and %d days' % days
        return r
    if days > 1:
        return '%d days' % days
    s = s - (days * 86400)
    hours = s // 3600
    if days == 1:
        r = 'one day'
        if hours > 0:
            r += ' and %d hours' % hours
        return r 
    s = s - (hours * 3600)
    minutes = s // 60
    seconds = s - (minutes * 60)
    if hours >= 6:
        return '%d hours' % hours
    if hours >= 1:
        r = '%d hours' % hours
        if hours == 1:
            r = 'one hour'
        if minutes > 0:
            r += ' and %d minutes' % minutes
        return r
    if minutes == 1:
        r = 'one minute'
        if seconds > 0:
            r += ' and %d seconds' % seconds
        return r
    if minutes == 0:
        return '%d seconds' % seconds
    if seconds == 0:
        return '%d minutes' % minutes
    return '%d minutes and %d seconds' % (minutes, seconds)

for i in range(10):
    print pow(8, i), seconds_to_time_left_string(pow(8, i))


Output:
1 1 seconds
8 8 seconds
64 one minute and 4 seconds
512 8 minutes and 32 seconds
4096 one hour and 8 minutes
32768 9 hours
262144 3 days
2097152 24 days
16777216 6 months
134217728 4 years
def seconds_to_time_left_string(total_seconds):
    s = int(total_seconds)
    years = s // 31104000
    if years > 1:
        return '%d years' % years
    s = s - (years * 31104000)
    months = s // 2592000
    if years == 1:
        r = 'one year'
        if months > 0:
            r += ' and %d months' % months
        return r
    if months > 1:
        return '%d months' % months
    s = s - (months * 2592000)
    days = s // 86400
    if months == 1:
        r = 'one month'
        if days > 0:
            r += ' and %d days' % days
        return r
    if days > 1:
        return '%d days' % days
    s = s - (days * 86400)
    hours = s // 3600
    if days == 1:
        r = 'one day'
        if hours > 0:
            r += ' and %d hours' % hours
        return r 
    s = s - (hours * 3600)
    minutes = s // 60
    seconds = s - (minutes * 60)
    if hours >= 6:
        return '%d hours' % hours
    if hours >= 1:
        r = '%d hours' % hours
        if hours == 1:
            r = 'one hour'
        if minutes > 0:
            r += ' and %d minutes' % minutes
        return r
    if minutes == 1:
        r = 'one minute'
        if seconds > 0:
            r += ' and %d seconds' % seconds
        return r
    if minutes == 0:
        return '%d seconds' % seconds
    if seconds == 0:
        return '%d minutes' % minutes
    return '%d minutes and %d seconds' % (minutes, seconds)

for i in range(10):
    print pow(8, i), seconds_to_time_left_string(pow(8, i))


Output:
1 1 seconds
8 8 seconds
64 one minute and 4 seconds
512 8 minutes and 32 seconds
4096 one hour and 8 minutes
32768 9 hours
262144 3 days
2097152 24 days
16777216 6 months
134217728 4 years

回答 14

我在工作中加班计算的输出也有类似的问题。该值应始终以HH:MM显示,即使该值大于一天并且该值可能会变为负数。我合并了一些所示的解决方案,也许其他人认为此解决方案很有用。我意识到,如果timedelta值为负,则大多数使用divmod方法显示的解决方案都无法立即使用:

def td2HHMMstr(td):
  '''Convert timedelta objects to a HH:MM string with (+/-) sign'''
  if td < datetime.timedelta(seconds=0):
    sign='-'
    td = -td
  else:
    sign = ''
  tdhours, rem = divmod(td.total_seconds(), 3600)
  tdminutes, rem = divmod(rem, 60)
  tdstr = '{}{:}:{:02d}'.format(sign, int(tdhours), int(tdminutes))
  return tdstr

timedelta到HH:MM字符串:

td2HHMMstr(datetime.timedelta(hours=1, minutes=45))
'1:54'

td2HHMMstr(datetime.timedelta(days=2, hours=3, minutes=2))
'51:02'

td2HHMMstr(datetime.timedelta(hours=-3, minutes=-2))
'-3:02'

td2HHMMstr(datetime.timedelta(days=-35, hours=-3, minutes=-2))
'-843:02'

I had a similar problem with the output of overtime calculation at work. The value should always show up in HH:MM, even when it is greater than one day and the value can get negative. I combined some of the shown solutions and maybe someone else find this solution useful. I realized that if the timedelta value is negative most of the shown solutions with the divmod method doesn’t work out of the box:

def td2HHMMstr(td):
  '''Convert timedelta objects to a HH:MM string with (+/-) sign'''
  if td < datetime.timedelta(seconds=0):
    sign='-'
    td = -td
  else:
    sign = ''
  tdhours, rem = divmod(td.total_seconds(), 3600)
  tdminutes, rem = divmod(rem, 60)
  tdstr = '{}{:}:{:02d}'.format(sign, int(tdhours), int(tdminutes))
  return tdstr

timedelta to HH:MM string:

td2HHMMstr(datetime.timedelta(hours=1, minutes=45))
'1:54'

td2HHMMstr(datetime.timedelta(days=2, hours=3, minutes=2))
'51:02'

td2HHMMstr(datetime.timedelta(hours=-3, minutes=-2))
'-3:02'

td2HHMMstr(datetime.timedelta(days=-35, hours=-3, minutes=-2))
'-843:02'

回答 15

import datetime
hours = datetime.timedelta(hours=16, minutes=30)
print((datetime.datetime(1,1,1) + hours).strftime('%H:%M'))
import datetime
hours = datetime.timedelta(hours=16, minutes=30)
print((datetime.datetime(1,1,1) + hours).strftime('%H:%M'))

回答 16

直接解决此问题的模板过滤器。内置函数int()从不舍入。F字符串(即f”)需要python 3.6。

@app_template_filter()
def diffTime(end, start):
    diff = (end - start).total_seconds()
    d = int(diff / 86400)
    h = int((diff - (d * 86400)) / 3600)
    m = int((diff - (d * 86400 + h * 3600)) / 60)
    s = int((diff - (d * 86400 + h * 3600 + m *60)))
    if d > 0:
        fdiff = f'{d}d {h}h {m}m {s}s'
    elif h > 0:
        fdiff = f'{h}h {m}m {s}s'
    elif m > 0:
        fdiff = f'{m}m {s}s'
    else:
        fdiff = f'{s}s'
    return fdiff

A straight forward template filter for this problem. The built-in function int() never rounds up. F-Strings (i.e. f”) require python 3.6.

@app_template_filter()
def diffTime(end, start):
    diff = (end - start).total_seconds()
    d = int(diff / 86400)
    h = int((diff - (d * 86400)) / 3600)
    m = int((diff - (d * 86400 + h * 3600)) / 60)
    s = int((diff - (d * 86400 + h * 3600 + m *60)))
    if d > 0:
        fdiff = f'{d}d {h}h {m}m {s}s'
    elif h > 0:
        fdiff = f'{h}h {m}m {s}s'
    elif m > 0:
        fdiff = f'{m}m {s}s'
    else:
        fdiff = f'{s}s'
    return fdiff

回答 17

from django.utils.translation import ngettext

def localize_timedelta(delta):
    ret = []
    num_years = int(delta.days / 365)
    if num_years > 0:
        delta -= timedelta(days=num_years * 365)
        ret.append(ngettext('%d year', '%d years', num_years) % num_years)

    if delta.days > 0:
        ret.append(ngettext('%d day', '%d days', delta.days) % delta.days)

    num_hours = int(delta.seconds / 3600)
    if num_hours > 0:
        delta -= timedelta(hours=num_hours)
        ret.append(ngettext('%d hour', '%d hours', num_hours) % num_hours)

    num_minutes = int(delta.seconds / 60)
    if num_minutes > 0:
        ret.append(ngettext('%d minute', '%d minutes', num_minutes) % num_minutes)

    return ' '.join(ret)

这将生成:

>>> from datetime import timedelta
>>> localize_timedelta(timedelta(days=3660, minutes=500))
'10 years 10 days 8 hours 20 minutes'
from django.utils.translation import ngettext

def localize_timedelta(delta):
    ret = []
    num_years = int(delta.days / 365)
    if num_years > 0:
        delta -= timedelta(days=num_years * 365)
        ret.append(ngettext('%d year', '%d years', num_years) % num_years)

    if delta.days > 0:
        ret.append(ngettext('%d day', '%d days', delta.days) % delta.days)

    num_hours = int(delta.seconds / 3600)
    if num_hours > 0:
        delta -= timedelta(hours=num_hours)
        ret.append(ngettext('%d hour', '%d hours', num_hours) % num_hours)

    num_minutes = int(delta.seconds / 60)
    if num_minutes > 0:
        ret.append(ngettext('%d minute', '%d minutes', num_minutes) % num_minutes)

    return ' '.join(ret)

This will produce:

>>> from datetime import timedelta
>>> localize_timedelta(timedelta(days=3660, minutes=500))
'10 years 10 days 8 hours 20 minutes'

回答 18

请检查此功能-它会将timedelta对象转换为字符串’HH:MM:SS’

def format_timedelta(td):
    hours, remainder = divmod(td.total_seconds(), 3600)
    minutes, seconds = divmod(remainder, 60)
    hours, minutes, seconds = int(hours), int(minutes), int(seconds)
    if hours < 10:
        hours = '0%s' % int(hours)
    if minutes < 10:
        minutes = '0%s' % minutes
    if seconds < 10:
        seconds = '0%s' % seconds
    return '%s:%s:%s' % (hours, minutes, seconds)

Please check this function – it converts timedelta object into string ‘HH:MM:SS’

def format_timedelta(td):
    hours, remainder = divmod(td.total_seconds(), 3600)
    minutes, seconds = divmod(remainder, 60)
    hours, minutes, seconds = int(hours), int(minutes), int(seconds)
    if hours < 10:
        hours = '0%s' % int(hours)
    if minutes < 10:
        minutes = '0%s' % minutes
    if seconds < 10:
        seconds = '0%s' % seconds
    return '%s:%s:%s' % (hours, minutes, seconds)

回答 19

一支班轮。由于timedelta不提供datetime的strftime,因此请将timedelta带回datetime,然后使用stftime。

这不仅可以实现OP要求的格式Hours:Minutes,现在,如果您的需求更改为其他表示形式,则可以利用datetime的strftime的完整格式化功能。

import datetime
td = datetime.timedelta(hours=2, minutes=10, seconds=5)
print(td)
print(datetime.datetime.strftime(datetime.datetime.strptime(str(td), "%H:%M:%S"), "%H:%M"))

Output:
2:10:05
02:10

这也解决了将时间增量格式化为H:MM:SS而不是HH:MM:SS的麻烦,这导致了我遇到这个问题,并且分享了我的解决方案。

One liner. Since timedeltas do not offer datetime’s strftime, bring the timedelta back to a datetime, and use stftime.

This can not only achieve the OP’s requested format Hours:Minutes, now you can leverage the full formatting power of datetime’s strftime, should your requirements change to another representation.

import datetime
td = datetime.timedelta(hours=2, minutes=10, seconds=5)
print(td)
print(datetime.datetime.strftime(datetime.datetime.strptime(str(td), "%H:%M:%S"), "%H:%M"))

Output:
2:10:05
02:10

This also solves the annoyance that timedeltas are formatted into strings as H:MM:SS rather than HH:MM:SS, which lead me to this problem, and the solution I’ve shared.


回答 20

如果您已经有一个timedelta obj,则只需将该obj转换为字符串即可。删除字符串的最后3个字符并打印。这将截断秒数部分,并以小时:分钟的格式打印其余部分。

t = str(timedeltaobj) 

print t[:-3]

If you already have a timedelta obj then just convert that obj into string. Remove the last 3 characters of the string and print. This will truncate the seconds part and print the rest of it in the format Hours:Minutes.

t = str(timedeltaobj) 

print t[:-3]

回答 21

如果您恰好IPython在包装中(应该如此),则它具有(到目前为止,无论如何)持续时间非常长的格式化程序(以秒为单位)。那是在各个地方使用的,例如%%time细胞魔术。我喜欢它能短期产生的格式:

>>> from IPython.core.magics.execution import _format_time
>>> 
>>> for v in range(-9, 10, 2):
...     dt = 1.25 * 10**v
...     print(_format_time(dt))

1.25 ns
125 ns
12.5 µs
1.25 ms
125 ms
12.5 s
20min 50s
1d 10h 43min 20s
144d 16h 13min 20s
14467d 14h 13min 20s

If you happen to have IPython in your packages (you should), it has (up to now, anyway) a very nice formatter for durations (in float seconds). That is used in various places, for example by the %%time cell magic. I like the format it produces for short durations:

>>> from IPython.core.magics.execution import _format_time
>>> 
>>> for v in range(-9, 10, 2):
...     dt = 1.25 * 10**v
...     print(_format_time(dt))

1.25 ns
125 ns
12.5 µs
1.25 ms
125 ms
12.5 s
20min 50s
1d 10h 43min 20s
144d 16h 13min 20s
14467d 14h 13min 20s

回答 22

t1 = datetime.datetime.strptime(StartTime, "%H:%M:%S %d-%m-%y")

t2 = datetime.datetime.strptime(EndTime, "%H:%M:%S %d-%m-%y")

return str(t2-t1)

因此对于:

StartTime = '15:28:53 21-07-13'
EndTime = '15:32:40 21-07-13'

返回:

'0:03:47'
t1 = datetime.datetime.strptime(StartTime, "%H:%M:%S %d-%m-%y")

t2 = datetime.datetime.strptime(EndTime, "%H:%M:%S %d-%m-%y")

return str(t2-t1)

So for:

StartTime = '15:28:53 21-07-13'
EndTime = '15:32:40 21-07-13'

returns:

'0:03:47'

回答 23

感谢大家的帮助。我采纳了您的许多想法并将它们组合在一起,让我知道您的想法。

我向此类添加了两个方法:

def hours(self):
    retval = ""
    if self.totalTime:
        hoursfloat = self.totalTime.seconds / 3600
        retval = round(hoursfloat)
    return retval

def minutes(self):
    retval = ""
    if self.totalTime:
        minutesfloat = self.totalTime.seconds / 60
        hoursAsMinutes = self.hours() * 60
        retval = round(minutesfloat - hoursAsMinutes)
    return retval

在我的django中,我使用了这个(sum是对象,它在字典中):

<td>{{ sum.0 }}</td>    
<td>{{ sum.1.hours|stringformat:"d" }}:{{ sum.1.minutes|stringformat:"#02.0d" }}</td>

Thanks everyone for your help. I took many of your ideas and put them together, let me know what you think.

I added two methods to the class like this:

def hours(self):
    retval = ""
    if self.totalTime:
        hoursfloat = self.totalTime.seconds / 3600
        retval = round(hoursfloat)
    return retval

def minutes(self):
    retval = ""
    if self.totalTime:
        minutesfloat = self.totalTime.seconds / 60
        hoursAsMinutes = self.hours() * 60
        retval = round(minutesfloat - hoursAsMinutes)
    return retval

In my django I used this (sum is the object and it is in a dictionary):

<td>{{ sum.0 }}</td>    
<td>{{ sum.1.hours|stringformat:"d" }}:{{ sum.1.minutes|stringformat:"#02.0d" }}</td>

TypeError:’str’不支持缓冲区接口

问题:TypeError:’str’不支持缓冲区接口

plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wb") as outfile:
    outfile.write(plaintext) 

上面的python代码给了我以下错误:

Traceback (most recent call last):
  File "C:/Users/Ankur Gupta/Desktop/Python_works/gzip_work1.py", line 33, in <module>
    compress_string()
  File "C:/Users/Ankur Gupta/Desktop/Python_works/gzip_work1.py", line 15, in compress_string
    outfile.write(plaintext)
  File "C:\Python32\lib\gzip.py", line 312, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wb") as outfile:
    outfile.write(plaintext) 

The above python code is giving me following error:

Traceback (most recent call last):
  File "C:/Users/Ankur Gupta/Desktop/Python_works/gzip_work1.py", line 33, in <module>
    compress_string()
  File "C:/Users/Ankur Gupta/Desktop/Python_works/gzip_work1.py", line 15, in compress_string
    outfile.write(plaintext)
  File "C:\Python32\lib\gzip.py", line 312, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface

回答 0

如果您使用的是Python3x,则string与Python 2.x的类型不同,则必须将其转换为字节(对其进行编码)。

plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wb") as outfile:
    outfile.write(bytes(plaintext, 'UTF-8'))

也不要使用像string或那样的变量file名作为模块或函数的名称。

编辑@汤姆

是的,非ASCII文本也会被压缩/解压缩。我使用UTF-8编码的波兰字母:

plaintext = 'Polish text: ąćęłńóśźżĄĆĘŁŃÓŚŹŻ'
filename = 'foo.gz'
with gzip.open(filename, 'wb') as outfile:
    outfile.write(bytes(plaintext, 'UTF-8'))
with gzip.open(filename, 'r') as infile:
    outfile_content = infile.read().decode('UTF-8')
print(outfile_content)

If you use Python3x then string is not the same type as for Python 2.x, you must cast it to bytes (encode it).

plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wb") as outfile:
    outfile.write(bytes(plaintext, 'UTF-8'))

Also do not use variable names like string or file while those are names of module or function.

EDIT @Tom

Yes, non-ASCII text is also compressed/decompressed. I use Polish letters with UTF-8 encoding:

plaintext = 'Polish text: ąćęłńóśźżĄĆĘŁŃÓŚŹŻ'
filename = 'foo.gz'
with gzip.open(filename, 'wb') as outfile:
    outfile.write(bytes(plaintext, 'UTF-8'))
with gzip.open(filename, 'r') as infile:
    outfile_content = infile.read().decode('UTF-8')
print(outfile_content)

回答 1

有一个更容易解决此问题的方法。

您只需要向t模式添加a 即可wt。这会导致Python将文件打开为文本文件,而不是二进制文件。然后一切都会正常。

完整的程序变为:

plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wt") as outfile:
    outfile.write(plaintext)

There is an easier solution to this problem.

You just need to add a t to the mode so it becomes wt. This causes Python to open the file as a text file and not binary. Then everything will just work.

The complete program becomes this:

plaintext = input("Please enter the text you want to compress")
filename = input("Please enter the desired filename")
with gzip.open(filename + ".gz", "wt") as outfile:
    outfile.write(plaintext)

回答 2

您不能将Python 3的“字符串”序列化为字节,而无需显式转换为某些编码。

outfile.write(plaintext.encode('utf-8'))

可能就是您想要的。同样适用于python 2.x和3.x。

You can not serialize a Python 3 ‘string’ to bytes without explict conversion to some encoding.

outfile.write(plaintext.encode('utf-8'))

is possibly what you want. Also this works for both python 2.x and 3.x.


回答 3

对于Python 3.x,您可以通过以下方式将文本转换为原始字节:

bytes("my data", "encoding")

例如:

bytes("attack at dawn", "utf-8")

返回的对象将与一起使用outfile.write

For Python 3.x you can convert your text to raw bytes through:

bytes("my data", "encoding")

For example:

bytes("attack at dawn", "utf-8")

The object returned will work with outfile.write.


回答 4

从py2切换到py3时,通常会出现此问题。在py2 plaintext中既是字符串也是字节数组类型。在py3 plaintext中只有一个字符串,并且在二进制模式下打开时,该方法outfile.write()实际上采用字节数组outfile,因此会引发异常。更改输入以plaintext.encode('utf-8')解决问题。继续阅读,如果这困扰您。

在py2中,file.write的声明使其看起来像您传入了一个字符串:file.write(str)。实际上,您正在传入一个字节数组,您应该已经读过这样的声明:file.write(bytes)。如果您这样阅读,问题很简单,file.write(bytes)需要一个字节类型,并且在py3中要从str中获取字节,您可以将其转换:

py3>> outfile.write(plaintext.encode('utf-8'))

为何py2 docs声明file.write使用字符串?在py2中,声明区别并不重要,因为:

py2>> str==bytes         #str and bytes aliased a single hybrid class in py2
True

py2 的str-bytes类具有一些方法/构造函数,这些方法/构造函数使其在某些方面类似于字符串类,在某些方面类似于字节数组类。方便file.write吗?

py2>> plaintext='my string literal'
py2>> type(plaintext)
str                              #is it a string or is it a byte array? it's both!

py2>> outfile.write(plaintext)   #can use plaintext as a byte array

为什么py3破坏了这个不错的系统?好吧,因为在py2中,基本字符串函数不适用于世界其他地方。测量具有非ASCII字符的单词的长度?

py2>> len('¡no')        #length of string=3, length of UTF-8 byte array=4, since with variable len encoding the non-ASCII chars = 2-6 bytes
4                       #always gives bytes.len not str.len

一直以来,您一直以为在py2 中请求字符串的len,所以您一直在从编码中获取字节数组的长度。这种含糊不清是双重责任阶层的根本问题。您实现哪个版本的方法调用?

好消息是py3可以解决此问题。它解开了strbytes类。的STR类有绳状的方法中,单独的字节类具有字节阵列方法:

py3>> len('¡ok')       #string
3
py3>> len('¡ok'.encode('utf-8'))     #bytes
4

希望知道这一点有助于揭开问题的神秘面纱,并使迁移的痛苦更容易承担。

This problem commonly occurs when switching from py2 to py3. In py2 plaintext is both a string and a byte array type. In py3 plaintext is only a string, and the method outfile.write() actually takes a byte array when outfile is opened in binary mode, so an exception is raised. Change the input to plaintext.encode('utf-8') to fix the problem. Read on if this bothers you.

In py2, the declaration for file.write made it seem like you passed in a string: file.write(str). Actually you were passing in a byte array, you should have been reading the declaration like this: file.write(bytes). If you read it like this the problem is simple, file.write(bytes) needs a bytes type and in py3 to get bytes out of a str you convert it:

py3>> outfile.write(plaintext.encode('utf-8'))

Why did the py2 docs declare file.write took a string? Well in py2 the declaration distinction didn’t matter because:

py2>> str==bytes         #str and bytes aliased a single hybrid class in py2
True

The str-bytes class of py2 has methods/constructors that make it behave like a string class in some ways and a byte array class in others. Convenient for file.write isn’t it?:

py2>> plaintext='my string literal'
py2>> type(plaintext)
str                              #is it a string or is it a byte array? it's both!

py2>> outfile.write(plaintext)   #can use plaintext as a byte array

Why did py3 break this nice system? Well because in py2 basic string functions didn’t work for the rest of the world. Measure the length of a word with a non-ASCII character?

py2>> len('¡no')        #length of string=3, length of UTF-8 byte array=4, since with variable len encoding the non-ASCII chars = 2-6 bytes
4                       #always gives bytes.len not str.len

All this time you thought you were asking for the len of a string in py2, you were getting the length of the byte array from the encoding. That ambiguity is the fundamental problem with double-duty classes. Which version of any method call do you implement?

The good news then is that py3 fixes this problem. It disentangles the str and bytes classes. The str class has string-like methods, the separate bytes class has byte array methods:

py3>> len('¡ok')       #string
3
py3>> len('¡ok'.encode('utf-8'))     #bytes
4

Hopefully knowing this helps de-mystify the issue, and makes the migration pain a little easier to bear.


回答 5

>>> s = bytes("s","utf-8")
>>> print(s)
b's'
>>> s = s.decode("utf-8")
>>> print(s)
s

好吧,如果对消除烦人的’b’字符有用,如果有人有更好的主意,请建议我或随时在这里随时编辑我。我只是新手

>>> s = bytes("s","utf-8")
>>> print(s)
b's'
>>> s = s.decode("utf-8")
>>> print(s)
s

Well if useful for you in case removing annoying ‘b’ character.If anyone got better idea please suggest me or feel free to edit me anytime in here.I’m just newbie


回答 6

为了Djangodjango.test.TestCase单元测试,我改变了我的Python2语法:

def test_view(self):
    response = self.client.get(reverse('myview'))
    self.assertIn(str(self.obj.id), response.content)
    ...

要使用Python3 .decode('utf8')语法:

def test_view(self):
    response = self.client.get(reverse('myview'))
    self.assertIn(str(self.obj.id), response.content.decode('utf8'))
    ...

For Django in django.test.TestCase unit testing, I changed my Python2 syntax:

def test_view(self):
    response = self.client.get(reverse('myview'))
    self.assertIn(str(self.obj.id), response.content)
    ...

To use the Python3 .decode('utf8') syntax:

def test_view(self):
    response = self.client.get(reverse('myview'))
    self.assertIn(str(self.obj.id), response.content.decode('utf8'))
    ...