标签归档:parsing

在Python字符串的最后一个分隔符上分割?

问题:在Python字符串的最后一个分隔符上分割?

对于在字符串中最后一次出现定界符时拆分字符串的建议Python惯用法是什么?例:

# instead of regular split
>> s = "a,b,c,d"
>> s.split(",")
>> ['a', 'b', 'c', 'd']

# ..split only on last occurrence of ',' in string:
>>> s.mysplit(s, -1)
>>> ['a,b,c', 'd']

mysplit接受第二个参数,即要分割的分隔符的出现。像常规列表索引一样,-1表示末尾的末尾。如何才能做到这一点?

What’s the recommended Python idiom for splitting a string on the last occurrence of the delimiter in the string? example:

# instead of regular split
>> s = "a,b,c,d"
>> s.split(",")
>> ['a', 'b', 'c', 'd']

# ..split only on last occurrence of ',' in string:
>>> s.mysplit(s, -1)
>>> ['a,b,c', 'd']

mysplit takes a second argument that is the occurrence of the delimiter to be split. Like in regular list indexing, -1 means the last from the end. How can this be done?


回答 0

使用.rsplit().rpartition()代替:

s.rsplit(',', 1)
s.rpartition(',')

str.rsplit()可让您指定拆分次数,而str.rpartition()仅拆分一次,但始终返回固定数量的元素(前缀,定界符和后缀),并且对于单个拆分情况而言更快。

演示:

>>> s = "a,b,c,d"
>>> s.rsplit(',', 1)
['a,b,c', 'd']
>>> s.rsplit(',', 2)
['a,b', 'c', 'd']
>>> s.rpartition(',')
('a,b,c', ',', 'd')

两种方法都从字符串的右侧开始拆分;通过str.rsplit()将最大值作为第二个参数,您可以仅分割最右边的出现。

Use .rsplit() or .rpartition() instead:

s.rsplit(',', 1)
s.rpartition(',')

str.rsplit() lets you specify how many times to split, while str.rpartition() only splits once but always returns a fixed number of elements (prefix, delimiter & postfix) and is faster for the single split case.

Demo:

>>> s = "a,b,c,d"
>>> s.rsplit(',', 1)
['a,b,c', 'd']
>>> s.rsplit(',', 2)
['a,b', 'c', 'd']
>>> s.rpartition(',')
('a,b,c', ',', 'd')

Both methods start splitting from the right-hand-side of the string; by giving str.rsplit() a maximum as the second argument, you get to split just the right-hand-most occurrences.


回答 1

您可以使用rsplit

string.rsplit('delimeter',1)[1]

从反向获取字符串。

You can use rsplit

string.rsplit('delimeter',1)[1]

To get the string from reverse.


回答 2

我只是为了好玩而做

    >>> s = 'a,b,c,d'
    >>> [item[::-1] for item in s[::-1].split(',', 1)][::-1]
    ['a,b,c', 'd']

警告:请参阅下面的第一个评论,此答案可能会出错。

I just did this for fun

    >>> s = 'a,b,c,d'
    >>> [item[::-1] for item in s[::-1].split(',', 1)][::-1]
    ['a,b,c', 'd']

Caution: Refer to the first comment in below where this answer can go wrong.


Python:从字符串中删除\ xa0?

问题:Python:从字符串中删除\ xa0?

我目前正在使用Beautiful Soup解析HTML文件并调用get_text(),但似乎我剩下很多表示空格的\ xa0 Unicode。有没有一种有效的方法可以在Python 2.7中将其全部删除,并将其更改为空格?我想更笼统的问题是,有没有办法删除Unicode格式?

我尝试使用:line = line.replace(u'\xa0',' '),如另一个线程所建议的那样,但是将\ xa0更改为u,所以现在到处都是“ u”。):

编辑:问题似乎已通过解决str.replace(u'\xa0', ' ').encode('utf-8'),但.encode('utf-8')不这样做replace()似乎会导致它吐出甚至更奇怪的字符,例如\ xc2。谁能解释一下?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I’m being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0’s to u’s, so now I have “u”s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?


回答 0

\ xa0实际上是Latin1(ISO 8859-1)中的连续字符,也是chr(160)。您应该将其替换为空格。

string = string.replace(u'\xa0', u' ')

当.encode(’utf-8’)时,它将把unicode编码为utf-8,这意味着每个unicode可以由1到4个字节表示。在这种情况下,\ xa0由2个字节\ xc2 \ xa0表示。

http://docs.python.org/howto/unicode.html上阅读。

请注意:此答案自2012年起,Python仍在继续,您unicodedata.normalize现在应该可以使用

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now


回答 1

Python unicodedata库中有许多有用的东西。功能之一就是它.normalize()

尝试:

new_str = unicodedata.normalize("NFKD", unicode_str)

如果您没有得到想要的结果,请使用上面链接中列出的任何其他方法替换NFKD。

There’s many useful things in Python’s unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don’t get the results you’re after.


回答 2

尝试在行尾使用.strip() line.strip()对我来说效果很好

Try using .strip() at the end of your line line.strip() worked well for me


回答 3

在尝试了几种方法之后,总结一下,这就是我的方法。以下是避免/从解析的HTML字符串中删除\ xa0字符的两种方法。

假设我们的原始html如下:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

因此,让我们尝试清除此HTML字符串:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

上面的代码在字符串中生成这些字符\ xa0。要正确删除它们,我们可以使用两种方法。

方法1(推荐): 第一个是BeautifulSoup的get_text方法,带参数为True, 因此我们的代码变为:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

方法2: 另一个选择是使用python的库unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

我还在此博客上详细介绍了这些方法,您可能想参考这些方法。

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup’s get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python’s library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.


回答 4

试试这个:

string.replace('\\xa0', ' ')

try this:

string.replace('\\xa0', ' ')

回答 5

我遇到了同样的问题,使用python从sqlite3数据库中提取了一些数据。上面的答案对我不起作用(不确定为什么),但是这样做:line = line.decode('ascii', 'ignore')但是,我的目标是删除\ xa0s,而不是用空格替换。

我是从Ned Batchelder的这个超级有用的unicode教程中得到的。

I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn’t work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.

I got this from this super-helpful unicode tutorial by Ned Batchelder.


回答 6

我在这里搜索无法打印的字符时遇到了问题。我使用MySQL UTF-8 general_ci并处理波兰语。对于有问题的字符串,我必须按以下步骤进行:

text=text.replace('\xc2\xa0', ' ')

这只是一个快速的解决方法,您可能应该尝试使用正确的编码设置进行操作。

I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:

text=text.replace('\xc2\xa0', ' ')

It is just fast workaround and you probablly should try something with right encoding setup.


回答 7

试试这个代码

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

Try this code

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

回答 8

UTF-8中的0xA0(Unicode)为0xC2A0。.encode('utf8')只会采用您的Unicode 0xA0并替换为UTF-8的0xC2A0。因此,0xC2s的出现……编码并没有取代,正如您现在可能已经意识到的那样。

0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8’s 0xC2A0. Hence the apparition of 0xC2s… Encoding is not replacing, as you’ve probably realized now.


回答 9

这等效于空格字符,因此将其删除

print(string.strip()) # no more xa0

It’s the equivalent of a space character, so strip it

print(string.strip()) # no more xa0

回答 10

在Beautiful Soup中,您可以传递get_text()strip参数,该参数从文本的开头和结尾去除空白。\xa0如果它出现在字符串的开头或结尾,它将删除或任何其他空格。Beautiful Soup用一个空字符串替换了\xa0,这为我解决了问题。

mytext = soup.get_text(strip=True)

In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.

mytext = soup.get_text(strip=True)

回答 11

具有正则表达式的通用版本(它将删除所有控制字符):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

Generic version with the regular expression (It will remove all the control characters):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

回答 12

Python会将其识别为空格字符,因此您可以split在不使用args的情况下使用常规空格将其加入:

line = ' '.join(line.split())

Python recognize it like a space character, so you can split it without args and join by a normal whitespace:

line = ' '.join(line.split())

如何在Python中解析JSON?

问题:如何在Python中解析JSON?

我的项目目前正在python中接收JSON消息,我需要从中获取一些信息。为此,我们将其设置为字符串中的一些简单JSON:

jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'

到目前为止,我一直在使用列表生成JSON请求json.dumps,但是与此相反,我认为我需要使用json.loads。但是我没有那么幸运。谁能为我提供一个片段,该片段将在上述示例"2"的输入中返回"two"

My project is currently receiving a JSON message in python which I need to get bits of information out of. For the purposes of this, let’s set it to some simple JSON in a string:

jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'

So far I’ve been generating JSON requests using a list and then json.dumps, but to do the opposite of this I think I need to use json.loads. However I haven’t had much luck with it. Could anyone provide me a snippet that would return "2" with the input of "two" in the above example?


回答 0

很简单:

import json
data = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print data['two']

Very simple:

import json
data = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print data['two']

回答 1

有时,您的json不是字符串。例如,如果您从这样的网址获取json:

j = urllib2.urlopen('http://site.com/data.json')

您将需要使用json.load,而不是json.loads:

j_obj = json.load(j)

(很容易忘记:“ s”代表“字符串”)

Sometimes your json is not a string. For example if you are getting a json from a url like this:

j = urllib2.urlopen('http://site.com/data.json')

you will need to use json.load, not json.loads:

j_obj = json.load(j)

(it is easy to forget: the ‘s’ is for ‘string’)


回答 2

对于URL或文件,请使用json.load()。对于具有.json内容的字符串,请使用json.loads()

#! /usr/bin/python

import json
# from pprint import pprint

json_file = 'my_cube.json'
cube = '1'

with open(json_file) as json_data:
    data = json.load(json_data)

# pprint(data)

print "Dimension: ", data['cubes'][cube]['dim']
print "Measures:  ", data['cubes'][cube]['meas']

For URL or file, use json.load(). For string with .json content, use json.loads().

#! /usr/bin/python

import json
# from pprint import pprint

json_file = 'my_cube.json'
cube = '1'

with open(json_file) as json_data:
    data = json.load(json_data)

# pprint(data)

print "Dimension: ", data['cubes'][cube]['dim']
print "Measures:  ", data['cubes'][cube]['meas']

回答 3

以下是可能帮助您的简单示例:

json_string = """
{
    "pk": 1, 
    "fa": "cc.ee", 
    "fb": {
        "fc": "", 
        "fd_id": "12345"
    }
}"""

import json
data = json.loads(json_string)
if data["fa"] == "cc.ee":
    data["fb"]["new_key"] = "cc.ee was present!"

print json.dumps(data)

上面代码的输出将是:

{"pk": 1, "fb": {"new_key": "cc.ee was present!", "fd_id": "12345", 
 "fc": ""}, "fa": "cc.ee"}

请注意,您可以设置dump的ident参数来像这样打印它(例如,当使用print json.dumps(data,indent = 4)时):

{
    "pk": 1, 
    "fb": {
        "new_key": "cc.ee was present!", 
        "fd_id": "12345", 
        "fc": ""
    }, 
    "fa": "cc.ee"
}

Following is simple example that may help you:

json_string = """
{
    "pk": 1, 
    "fa": "cc.ee", 
    "fb": {
        "fc": "", 
        "fd_id": "12345"
    }
}"""

import json
data = json.loads(json_string)
if data["fa"] == "cc.ee":
    data["fb"]["new_key"] = "cc.ee was present!"

print json.dumps(data)

The output for the above code will be:

{"pk": 1, "fb": {"new_key": "cc.ee was present!", "fd_id": "12345", 
 "fc": ""}, "fa": "cc.ee"}

Note that you can set the ident argument of dump to print it like so (for example,when using print json.dumps(data , indent=4)):

{
    "pk": 1, 
    "fb": {
        "new_key": "cc.ee was present!", 
        "fd_id": "12345", 
        "fc": ""
    }, 
    "fa": "cc.ee"
}

回答 4

可以使用json或ast python模块:

Using json :
=============

import json
jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'
json_data = json.loads(jsonStr)
print(f"json_data: {json_data}")
print(f"json_data['two']: {json_data['two']}")

Output:
json_data: {'one': '1', 'two': '2', 'three': '3'}
json_data['two']: 2




Using ast:
==========

import ast
jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'
json_dict = ast.literal_eval(jsonStr)
print(f"json_dict: {json_dict}")
print(f"json_dict['two']: {json_dict['two']}")

Output:
json_dict: {'one': '1', 'two': '2', 'three': '3'}
json_dict['two']: 2

Can use either json or ast python modules:

Using json :
=============

import json
jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'
json_data = json.loads(jsonStr)
print(f"json_data: {json_data}")
print(f"json_data['two']: {json_data['two']}")

Output:
json_data: {'one': '1', 'two': '2', 'three': '3'}
json_data['two']: 2




Using ast:
==========

import ast
jsonStr = '{"one" : "1", "two" : "2", "three" : "3"}'
json_dict = ast.literal_eval(jsonStr)
print(f"json_dict: {json_dict}")
print(f"json_dict['two']: {json_dict['two']}")

Output:
json_dict: {'one': '1', 'two': '2', 'three': '3'}
json_dict['two']: 2

熊猫read_csv low_memory和dtype选项

问题:熊猫read_csv low_memory和dtype选项

打电话时

df = pd.read_csv('somefile.csv')

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:DtypeWarning:列(4,5,7,16)具有混合类型。在导入时指定dtype选项,或将low_memory = False设置为false。

为什么dtype选项与关联low_memory,为什么使它False有助于解决此问题?

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?


回答 0

不推荐使用的low_memory选项

low_memory选项未正确弃用,但应该正确使用,因为它实际上没有做任何不同的事情[ 来源 ]

收到此low_memory警告的原因是因为猜测每列的dtypes非常需要内存。熊猫尝试通过分析每列中的数据来确定要设置的dtype。

Dtype猜测(非常糟糕)

一旦读取了整个文件,熊猫便只能确定列应具有的dtype。这意味着在读取整个文件之前,无法真正解析任何内容,除非您冒着在读取最后一个值时不得不更改该列的dtype的风险。

考虑一个文件的示例,该文件具有一个名为user_id的列。它包含1000万行,其中user_id始终是数字。由于熊猫不能只知道数字,因此它可能会一直保留为原始字符串,直到它读取了整个文件。

指定dtypes(应该总是这样做)

dtype={'user_id': int}

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

还值得注意的是,如果文件的最后一行将被"foobar"写入user_id列中,那么如果指定了上面的dtype,则加载将崩溃。

定义dtypes时会中断的中断数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes通常是一个numpy的东西,请在这里阅读有关它们的更多信息:http ://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

存在哪些dtype?

我们可以访问numpy dtypes:float,int,bool,timedelta64 [ns]和datetime64 [ns]。请注意,numpy日期/时间dtypes 识别时区。

熊猫通过自己的方式扩展了这套dtypes:

‘datetime64 [ns,]’这是一个时区感知的时间戳。

‘category’本质上是一个枚举(以整数键表示的字符串以保存

‘period []’不要与timedelta混淆,这些对象实际上是固定在特定时间段的

“稀疏”,“ Sparse [int]”,“ Sparse [float]”用于稀疏数据或“其中有很多漏洞的数据”,而不是在数据框中保存NaN或None,它忽略了对象,从而节省了空间。

“间隔”本身是一个主题,但其主要用途是用于索引。在这里查看更多

与numpy变体不同,“ Int8”,“ Int16”,“ Int32”,“ Int64”,“ UInt8”,“ UInt16”,“ UInt32”,“ UInt64”都是可为空的熊猫特定整数。

‘string’是用于处理字符串数据的特定dtype,可访问.str系列中的属性。

‘boolean’类似于numpy’bool’,但它也支持丢失数据。

在此处阅读完整的参考:

熊猫DType参考

陷阱,注意事项,笔记

设置dtype=object将使上面的警告静音,但不会使其更有效地使用内存,仅在有任何处理时才有效。

设置dtype=unicode不会做任何事情,因为对于numpy,a unicode表示为object

转换器的使用

@sparrow正确指出了转换器的用法,以避免在遇到'foobar'指定为的列时遇到大熊猫int。我想补充一点,转换器在熊猫中使用时确实很笨重且效率低下,应该作为最后的手段使用。这是因为read_csv进程是单个进程。

CSV文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来由多个转换器并行更有效地进行处理,而这是熊猫所不支持的。但这是一个不同的故事。

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

‘datetime64[ns, ]’ Which is a time zone aware timestamp.

‘category’ which is essentially an enum (strings represented by integer keys to save

‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods

‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

‘Interval’ is a topic of its own but its main use is for indexing. See more here

‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.

‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.

‘boolean’ is like the numpy ‘bool’ but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.


回答 1

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫文件:

dtype:类型名称或列的字典->类型

至于low_memory,默认情况下为True 尚未记录。我认为这无关紧要。该错误消息是通用的,因此无论如何您都无需弄混low_memory。希望这会有所帮助,如果您还有其他问题,请告诉我

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems


回答 2

df = pd.read_csv('somefile.csv', low_memory=False)

这应该可以解决问题。从CSV读取180万行时,出现了完全相同的错误。

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.


回答 3

如firelynx先前所述,如果显式指定了dtype并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用像这样的转换器作为变通方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答 4

我有一个约400MB的文件类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我将检查您的数据帧不大于系统内存,重新启动,清除RAM,然后再继续。如果您仍然遇到错误,则值得确保您的.csv文件正常,请在Excel中快速查看并确保没有明显的损坏。原始数据损坏可能会给企业造成严重破坏。

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…


回答 5

处理巨大的csv文件(600万行)时,我遇到了类似的问题。我遇到了三个问题:1.文件包含奇怪的字符(使用编码修复)2.未指定数据类型(使用dtype属性修复)3.使用上述方法,我仍然面临与file_format相关的问题,即根据文件名定义(使用try ..固定,..除外)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

回答 6

它在low_memory = False导入DataFrame时对我有用。这就是对我有用的所有更改:

df = pd.read_csv('export4_16.csv',low_memory=False)

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)

在Python中使用设置文件的最佳做法是什么?[关闭]

问题:在Python中使用设置文件的最佳做法是什么?[关闭]

我有一个运行有很多参数的命令行脚本。现在到了我有太多参数的地步,我也想以字典形式有一些参数。

因此,为了简化操作,我想使用设置文件来运行脚本。我真的不知道该使用什么库来解析文件。最佳做法是什么?我当然可以自己动手做一些事情,但是如果有图书馆可以帮助我,我会不胜枚举。

一些“需求”:

  • 与其使用pickle我,我不希望它是一个易于阅读和编辑的简单文本文件。
  • 我希望能够在其中添加类似字典的数据,即应支持某种形式的嵌套。

简化的伪示例文件:

truck:
    color: blue
    brand: ford
city: new york
cabriolet:
    color: black
    engine:
        cylinders: 8
        placement: mid
    doors: 2

I have a command line script that I run with a lot of arguments. I have now come to a point where I have too many arguments, and I want to have some arguments in dictionary form too.

So in order to simplify things I would like to run the script with a settings file instead. I don’t really know what libraries to use for the parsing of the file. What’s the best practice for doing this? I could of course hammer something out myself, but if there is some library for this, I’m all ears.

A few ‘demands’:

  • Rather than using pickle I would like it to be a straight forward text file that can easily be read and edited.
  • I want to be able to add dictionary-like data in it, i.e., some form of nesting should be supported.

A simplified pseudo example file:

truck:
    color: blue
    brand: ford
city: new york
cabriolet:
    color: black
    engine:
        cylinders: 8
        placement: mid
    doors: 2

回答 0

您可以有一个常规的Python模块,例如config.py,如下所示:

truck = dict(
    color = 'blue',
    brand = 'ford',
)
city = 'new york'
cabriolet = dict(
    color = 'black',
    engine = dict(
        cylinders = 8,
        placement = 'mid',
    ),
    doors = 2,
)

并像这样使用它:

import config
print config.truck['color']  

You can have a regular Python module, say config.py, like this:

truck = dict(
    color = 'blue',
    brand = 'ford',
)
city = 'new york'
cabriolet = dict(
    color = 'black',
    engine = dict(
        cylinders = 8,
        placement = 'mid',
    ),
    doors = 2,
)

and use it like this:

import config
print config.truck['color']  

回答 1

您提供的样本配置实际上是有效的YAML。实际上,YAML可以满足您的所有需求,并以多种语言实现,并且非常人性化。我强烈建议您使用它。该PyYAML项目提供了一个很好的Python模块,实现YAML。

使用yaml模块非常简单:

import yaml
config = yaml.safe_load(open("path/to/config.yml"))

The sample config you provided is actually valid YAML. In fact, YAML meets all of your demands, is implemented in a large number of languages, and is extremely human friendly. I would highly recommend you use it. The PyYAML project provides a nice python module, that implements YAML.

To use the yaml module is extremely simple:

import yaml
config = yaml.safe_load(open("path/to/config.yml"))

回答 2

我发现这是最有用和易于使用的 https://wiki.python.org/moin/ConfigParserExamples

您只需创建一个“ myfile.ini”,例如:

[SectionOne]
Status: Single
Name: Derek
Value: Yes
Age: 30
Single: True

[SectionTwo]
FavoriteColor=Green
[SectionThree]
FamilyName: Johnson

[Others]
Route: 66

并像这样检索数据:

>>> import ConfigParser
>>> Config = ConfigParser.ConfigParser()
>>> Config
<ConfigParser.ConfigParser instance at 0x00BA9B20>
>>> Config.read("myfile.ini")
['c:\\tomorrow.ini']
>>> Config.sections()
['Others', 'SectionThree', 'SectionOne', 'SectionTwo']
>>> Config.options('SectionOne')
['Status', 'Name', 'Value', 'Age', 'Single']
>>> Config.get('SectionOne', 'Status')
'Single'

I Found this the most useful and easy to use https://wiki.python.org/moin/ConfigParserExamples

You just create a “myfile.ini” like:

[SectionOne]
Status: Single
Name: Derek
Value: Yes
Age: 30
Single: True

[SectionTwo]
FavoriteColor=Green
[SectionThree]
FamilyName: Johnson

[Others]
Route: 66

And retrieve the data like:

>>> import ConfigParser
>>> Config = ConfigParser.ConfigParser()
>>> Config
<ConfigParser.ConfigParser instance at 0x00BA9B20>
>>> Config.read("myfile.ini")
['c:\\tomorrow.ini']
>>> Config.sections()
['Others', 'SectionThree', 'SectionOne', 'SectionTwo']
>>> Config.options('SectionOne')
['Status', 'Name', 'Value', 'Age', 'Single']
>>> Config.get('SectionOne', 'Status')
'Single'

回答 3

Yaml和Json是存储设置/配置的最简单,最常用的文件格式。PyYaml可用于解析yaml。Json已经从2.5开始成为python的一部分。Yaml是Json的超集。Json将解决大多数使用情况,但需要转义的多行字符串除外。Yaml也会处理这些情况。

>>> import json
>>> config = {'handler' : 'adminhandler.py', 'timeoutsec' : 5 }
>>> json.dump(config, open('/tmp/config.json', 'w'))
>>> json.load(open('/tmp/config.json'))   
{u'handler': u'adminhandler.py', u'timeoutsec': 5}

Yaml and Json are the simplest and most commonly used file formats to store settings/config. PyYaml can be used to parse yaml. Json is already part of python from 2.5. Yaml is a superset of Json. Json will solve most uses cases except multi line strings where escaping is required. Yaml takes care of these cases too.

>>> import json
>>> config = {'handler' : 'adminhandler.py', 'timeoutsec' : 5 }
>>> json.dump(config, open('/tmp/config.json', 'w'))
>>> json.load(open('/tmp/config.json'))   
{u'handler': u'adminhandler.py', u'timeoutsec': 5}

Pydantic-使用Python类型提示进行数据解析和验证

使用Python类型提示进行数据验证和设置管理

快速且可扩展,虚伪的很好地玩你的短裤/IDE/大脑。定义数据在纯规范Python3.6+中的格式;使用以下命令进行验证虚伪的

帮助

看见documentation有关更多详细信息,请参阅

安装

使用以下方式安装pip install -U pydanticconda install pydantic -c conda-forge有关要进行的更多安装选项,请参阅虚伪的更快,请参阅Install部分,请参阅文档中的

一个简单的例子

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []

external_data = {'id': '123', 'signup_ts': '2017-06-01 12:22', 'friends': [1, '2', b'3']}
user = User(**external_data)
print(user)
#> User id=123 name='John Doe' signup_ts=datetime.datetime(2017, 6, 1, 12, 22) friends=[1, 2, 3]
print(user.id)
#> 123

贡献

有关设置开发环境以及如何为虚伪的,请参见Contributing to Pydantic

报告安全漏洞

请参阅我们的security policy

如何将字符串解析为float或int?

问题:如何将字符串解析为float或int?

在Python中,如何解析类似于"545.2222"其对应的float值的数字字符串545.2222?还是将字符串解析为"31"整数31

我只是想知道如何分析一个浮动 strfloat,和(单独)的INT strint

In Python, how can I parse a numeric string like "545.2222" to its corresponding float value, 545.2222? Or parse the string "31" to an integer, 31?

I just want to know how to parse a float str to a float, and (separately) an int str to an int.


回答 0

>>> a = "545.2222"
>>> float(a)
545.22220000000004
>>> int(float(a))
545
>>> a = "545.2222"
>>> float(a)
545.22220000000004
>>> int(float(a))
545

回答 1

def num(s):
    try:
        return int(s)
    except ValueError:
        return float(s)
def num(s):
    try:
        return int(s)
    except ValueError:
        return float(s)

回答 2

检查字符串是否为浮点数的Python方法:

def is_float(value):
  try:
    float(value)
    return True
  except:
    return False

此功能的更长更准确的名称可能是: is_convertible_to_float(value)

什么是Python中的浮点数,哪些不是浮点数,可能会让您感到惊讶:

val                   is_float(val) Note
--------------------  ----------   --------------------------------
""                    False        Blank string
"127"                 True         Passed string
True                  True         Pure sweet Truth
"True"                False        Vile contemptible lie
False                 True         So false it becomes true
"123.456"             True         Decimal
"      -127    "      True         Spaces trimmed
"\t\n12\r\n"          True         whitespace ignored
"NaN"                 True         Not a number
"NaNanananaBATMAN"    False        I am Batman
"-iNF"                True         Negative infinity
"123.E4"              True         Exponential notation
".1"                  True         mantissa only
"1,234"               False        Commas gtfo
u'\x30'               True         Unicode is fine.
"NULL"                False        Null is not special
0x3fade               True         Hexadecimal
"6e7777777777777"     True         Shrunk to infinity
"1.797693e+308"       True         This is max value
"infinity"            True         Same as inf
"infinityandBEYOND"   False        Extra characters wreck it
"12.34.56"            False        Only one dot allowed
u'四'                 False        Japanese '4' is not a float.
"#56"                 False        Pound sign
"56%"                 False        Percent of what?
"0E0"                 True         Exponential, move dot 0 places
0**0                  True         0___0  Exponentiation
"-5e-5"               True         Raise to a negative number
"+1e1"                True         Plus is OK with exponent
"+1e1^5"              False        Fancy exponent not interpreted
"+1e1.3"              False        No decimals in exponent
"-+1"                 False        Make up your mind
"(1)"                 False        Parenthesis is bad

您以为知道什么数字?你不像你想的那样好!并不奇怪。

不要在对生命至关重要的软件上使用此代码!

用这种方式捕获广泛的异常,杀死金丝雀和吞噬异常会产生很小的机会,即有效的float字符串将返回false。该float(...)行代码可以失败的任何什么都没有做的字符串的内容一千个理由。但是,如果您使用Python这样的鸭子式原型语言来编写至关重要的软件,那么您将遇到更大的问题。

Python method to check if a string is a float:

def is_float(value):
  try:
    float(value)
    return True
  except:
    return False

A longer and more accurate name for this function could be: is_convertible_to_float(value)

What is, and is not a float in Python may surprise you:

val                   is_float(val) Note
--------------------  ----------   --------------------------------
""                    False        Blank string
"127"                 True         Passed string
True                  True         Pure sweet Truth
"True"                False        Vile contemptible lie
False                 True         So false it becomes true
"123.456"             True         Decimal
"      -127    "      True         Spaces trimmed
"\t\n12\r\n"          True         whitespace ignored
"NaN"                 True         Not a number
"NaNanananaBATMAN"    False        I am Batman
"-iNF"                True         Negative infinity
"123.E4"              True         Exponential notation
".1"                  True         mantissa only
"1,234"               False        Commas gtfo
u'\x30'               True         Unicode is fine.
"NULL"                False        Null is not special
0x3fade               True         Hexadecimal
"6e7777777777777"     True         Shrunk to infinity
"1.797693e+308"       True         This is max value
"infinity"            True         Same as inf
"infinityandBEYOND"   False        Extra characters wreck it
"12.34.56"            False        Only one dot allowed
u'四'                 False        Japanese '4' is not a float.
"#56"                 False        Pound sign
"56%"                 False        Percent of what?
"0E0"                 True         Exponential, move dot 0 places
0**0                  True         0___0  Exponentiation
"-5e-5"               True         Raise to a negative number
"+1e1"                True         Plus is OK with exponent
"+1e1^5"              False        Fancy exponent not interpreted
"+1e1.3"              False        No decimals in exponent
"-+1"                 False        Make up your mind
"(1)"                 False        Parenthesis is bad

You think you know what numbers are? You are not so good as you think! Not big surprise.

Don’t use this code on life-critical software!

Catching broad exceptions this way, killing canaries and gobbling the exception creates a tiny chance that a valid float as string will return false. The float(...) line of code can failed for any of a thousand reasons that have nothing to do with the contents of the string. But if you’re writing life-critical software in a duck-typing prototype language like Python, then you’ve got much larger problems.


回答 3

这是另一个值得一提的方法ast.literal_eval

这可用于安全地评估包含来自不受信任来源的Python表达式的字符串,而无需自己解析值。

也就是说,一个安全的“评估”

>>> import ast
>>> ast.literal_eval("545.2222")
545.2222
>>> ast.literal_eval("31")
31

This is another method which deserves to be mentioned here, ast.literal_eval:

This can be used for safely evaluating strings containing Python expressions from untrusted sources without the need to parse the values oneself.

That is, a safe ‘eval’

>>> import ast
>>> ast.literal_eval("545.2222")
545.2222
>>> ast.literal_eval("31")
31

回答 4

float(x) if '.' in x else int(x)
float(x) if '.' in x else int(x)

回答 5

本地化和逗号

您应该考虑数字的字符串表示形式中可能出现逗号的情况,例如 float("545,545.2222")抛出异常的情况。而是使用in locale中的方法将字符串转换为数字并正确解释逗号。locale.atof一旦为所需的数字约定设置了语言环境,该方法便会一步转换为浮点数。

示例1-美国数字约定

在美国和英国,逗号可以用作千位分隔符。在具有美国语言环境的此示例中,逗号作为分隔符正确处理:

>>> import locale
>>> a = u'545,545.2222'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.atof(a)
545545.2222
>>> int(locale.atof(a))
545545
>>>

示例2-欧洲数字约定

在世界上大多数国家/地区,逗号用于小数点而不是句点。在此使用法语语言环境的示例中,逗号被正确处理为小数点:

>>> import locale
>>> b = u'545,2222'
>>> locale.setlocale(locale.LC_ALL, 'fr_FR')
'fr_FR'
>>> locale.atof(b)
545.2222

该方法locale.atoi也可用,但参数应为整数。

Localization and commas

You should consider the possibility of commas in the string representation of a number, for cases like float("545,545.2222") which throws an exception. Instead, use methods in locale to convert the strings to numbers and interpret commas correctly. The locale.atof method converts to a float in one step once the locale has been set for the desired number convention.

Example 1 — United States number conventions

In the United States and the UK, commas can be used as a thousands separator. In this example with American locale, the comma is handled properly as a separator:

>>> import locale
>>> a = u'545,545.2222'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.atof(a)
545545.2222
>>> int(locale.atof(a))
545545
>>>

Example 2 — European number conventions

In the majority of countries of the world, commas are used for decimal marks instead of periods. In this example with French locale, the comma is correctly handled as a decimal mark:

>>> import locale
>>> b = u'545,2222'
>>> locale.setlocale(locale.LC_ALL, 'fr_FR')
'fr_FR'
>>> locale.atof(b)
545.2222

The method locale.atoi is also available, but the argument should be an integer.


回答 6

如果您不喜欢第三方模块,则可以签出fastnumbers模块。它提供了一个名为fast_real的函数,该函数可以完全满足此问题的要求,并且比纯Python实现要快:

>>> from fastnumbers import fast_real
>>> fast_real("545.2222")
545.2222
>>> type(fast_real("545.2222"))
float
>>> fast_real("31")
31
>>> type(fast_real("31"))
int

If you aren’t averse to third-party modules, you could check out the fastnumbers module. It provides a function called fast_real that does exactly what this question is asking for and does it faster than a pure-Python implementation:

>>> from fastnumbers import fast_real
>>> fast_real("545.2222")
545.2222
>>> type(fast_real("545.2222"))
float
>>> fast_real("31")
31
>>> type(fast_real("31"))
int

回答 7

用户codelogicharley是正确的,但是请记住,如果您知道字符串是整数(例如545),则可以调用int(“ 545”)而不先进行浮点运算。

如果您的字符串在列表中,则也可以使用map函数。

>>> x = ["545.0", "545.6", "999.2"]
>>> map(float, x)
[545.0, 545.60000000000002, 999.20000000000005]
>>>

只有它们都是相同的类型才是好的。

Users codelogic and harley are correct, but keep in mind if you know the string is an integer (for example, 545) you can call int(“545”) without first casting to float.

If your strings are in a list, you could use the map function as well.

>>> x = ["545.0", "545.6", "999.2"]
>>> map(float, x)
[545.0, 545.60000000000002, 999.20000000000005]
>>>

It is only good if they’re all the same type.


回答 8

在Python中,如何将“ 545.2222”之类的数字字符串解析为其对应的浮点值542.2222?还是将字符串“ 31”解析为整数31? 我只想知道如何将float字符串解析为float,以及将int字符串分别解析为int。

您最好单独进行这些操作。如果您要混合使用它们,则可能会在以后遇到问题。简单的答案是:

"545.2222" 漂浮:

>>> float("545.2222")
545.2222

"31" 到一个整数:

>>> int("31")
31

其他与字符串和文字之间的转换,整数转换:

来自各种基准的转换,您应该事先知道基准(默认值为10)。请注意,您可以为它们加上Python期望的字面量(请参见下文)或删除前缀:

>>> int("0b11111", 2)
31
>>> int("11111", 2)
31
>>> int('0o37', 8)
31
>>> int('37', 8)
31
>>> int('0x1f', 16)
31
>>> int('1f', 16)
31

如果您不预先知道基础,但是您知道它们将具有正确的前缀,那么如果您通过0作为基础,Python可以为您推断出这个前缀:

>>> int("0b11111", 0)
31
>>> int('0o37', 0)
31
>>> int('0x1f', 0)
31

其他基数的非十进制(即整数)文字

但是,如果您的动机是让自己的代码清楚地表示硬编码的特定值,则可能不需要从基数进行转换-您可以让Python使用正确的语法自动为您完成。

您可以使用apropos前缀自动转换为具有以下文字的整数。这些对Python 2和3有效:

二进制前缀 0b

>>> 0b11111
31

八进制,前缀 0o

>>> 0o37
31

十六进制,前缀 0x

>>> 0x1f
31

当描述二进制标志,代码中的文件许可权或颜色的十六进制值时,这很有用-例如,请注意不要使用引号:

>>> 0b10101 # binary flags
21
>>> 0o755 # read, write, execute perms for owner, read & ex for group & others
493
>>> 0xffffff # the color, white, max values for red, green, and blue
16777215

使模棱两可的Python 2八进制与Python 3兼容

如果您在Python 2中看到一个以0开头的整数,则这是(不建议使用的)八进制语法。

>>> 037
31

这很糟糕,因为看起来值应该是37。因此,在Python 3中,它现在引发了SyntaxError

>>> 037
  File "<stdin>", line 1
    037
      ^
SyntaxError: invalid token

使用0o前缀将Python 2八进制转换为在2和3中均可使用的八进制:

>>> 0o37
31

In Python, how can I parse a numeric string like “545.2222” to its corresponding float value, 542.2222? Or parse the string “31” to an integer, 31? I just want to know how to parse a float string to a float, and (separately) an int string to an int.

It’s good that you ask to do these separately. If you’re mixing them, you may be setting yourself up for problems later. The simple answer is:

"545.2222" to float:

>>> float("545.2222")
545.2222

"31" to an integer:

>>> int("31")
31

Other conversions, ints to and from strings and literals:

Conversions from various bases, and you should know the base in advance (10 is the default). Note you can prefix them with what Python expects for its literals (see below) or remove the prefix:

>>> int("0b11111", 2)
31
>>> int("11111", 2)
31
>>> int('0o37', 8)
31
>>> int('37', 8)
31
>>> int('0x1f', 16)
31
>>> int('1f', 16)
31

If you don’t know the base in advance, but you do know they will have the correct prefix, Python can infer this for you if you pass 0 as the base:

>>> int("0b11111", 0)
31
>>> int('0o37', 0)
31
>>> int('0x1f', 0)
31

Non-Decimal (i.e. Integer) Literals from other Bases

If your motivation is to have your own code clearly represent hard-coded specific values, however, you may not need to convert from the bases – you can let Python do it for you automatically with the correct syntax.

You can use the apropos prefixes to get automatic conversion to integers with the following literals. These are valid for Python 2 and 3:

Binary, prefix 0b

>>> 0b11111
31

Octal, prefix 0o

>>> 0o37
31

Hexadecimal, prefix 0x

>>> 0x1f
31

This can be useful when describing binary flags, file permissions in code, or hex values for colors – for example, note no quotes:

>>> 0b10101 # binary flags
21
>>> 0o755 # read, write, execute perms for owner, read & ex for group & others
493
>>> 0xffffff # the color, white, max values for red, green, and blue
16777215

Making ambiguous Python 2 octals compatible with Python 3

If you see an integer that starts with a 0, in Python 2, this is (deprecated) octal syntax.

>>> 037
31

It is bad because it looks like the value should be 37. So in Python 3, it now raises a SyntaxError:

>>> 037
  File "<stdin>", line 1
    037
      ^
SyntaxError: invalid token

Convert your Python 2 octals to octals that work in both 2 and 3 with the 0o prefix:

>>> 0o37
31

回答 9

这个问题似乎有点老了。但是让我建议一个函数parseStr,它的功能类似,即返回整数或浮点数,并且如果无法将给定的ASCII字符串转换为其中的任何一个,则它将返回原样。当然,可以将代码调整为仅执行所需的操作:

   >>> import string
   >>> parseStr = lambda x: x.isalpha() and x or x.isdigit() and \
   ...                      int(x) or x.isalnum() and x or \
   ...                      len(set(string.punctuation).intersection(x)) == 1 and \
   ...                      x.count('.') == 1 and float(x) or x
   >>> parseStr('123')
   123
   >>> parseStr('123.3')
   123.3
   >>> parseStr('3HC1')
   '3HC1'
   >>> parseStr('12.e5')
   1200000.0
   >>> parseStr('12$5')
   '12$5'
   >>> parseStr('12.2.2')
   '12.2.2'

The question seems a little bit old. But let me suggest a function, parseStr, which makes something similar, that is, returns integer or float and if a given ASCII string cannot be converted to none of them it returns it untouched. The code of course might be adjusted to do only what you want:

   >>> import string
   >>> parseStr = lambda x: x.isalpha() and x or x.isdigit() and \
   ...                      int(x) or x.isalnum() and x or \
   ...                      len(set(string.punctuation).intersection(x)) == 1 and \
   ...                      x.count('.') == 1 and float(x) or x
   >>> parseStr('123')
   123
   >>> parseStr('123.3')
   123.3
   >>> parseStr('3HC1')
   '3HC1'
   >>> parseStr('12.e5')
   1200000.0
   >>> parseStr('12$5')
   '12$5'
   >>> parseStr('12.2.2')
   '12.2.2'

回答 10

float("545.2222")int(float("545.2222"))

float("545.2222") and int(float("545.2222"))


回答 11

我为此使用此功能

import ast

def parse_str(s):
   try:
      return ast.literal_eval(str(s))
   except:
      return

它将字符串转换为其类型

value = parse_str('1')  # Returns Integer
value = parse_str('1.5')  # Returns Float

I use this function for that

import ast

def parse_str(s):
   try:
      return ast.literal_eval(str(s))
   except:
      return

It will convert the string to its type

value = parse_str('1')  # Returns Integer
value = parse_str('1.5')  # Returns Float

回答 12

YAML解析器可以帮助你找出你的数据类型的字符串是什么。使用yaml.load(),然后可以使用type(result)测试类型:

>>> import yaml

>>> a = "545.2222"
>>> result = yaml.load(a)
>>> result
545.22220000000004
>>> type(result)
<type 'float'>

>>> b = "31"
>>> result = yaml.load(b)
>>> result
31
>>> type(result)
<type 'int'>

>>> c = "HI"
>>> result = yaml.load(c)
>>> result
'HI'
>>> type(result)
<type 'str'>

The YAML parser can help you figure out what datatype your string is. Use yaml.load(), and then you can use type(result) to test for type:

>>> import yaml

>>> a = "545.2222"
>>> result = yaml.load(a)
>>> result
545.22220000000004
>>> type(result)
<type 'float'>

>>> b = "31"
>>> result = yaml.load(b)
>>> result
31
>>> type(result)
<type 'int'>

>>> c = "HI"
>>> result = yaml.load(c)
>>> result
'HI'
>>> type(result)
<type 'str'>

回答 13

def get_int_or_float(v):
    number_as_float = float(v)
    number_as_int = int(number_as_float)
    return number_as_int if number_as_float == number_as_int else number_as_float
def get_int_or_float(v):
    number_as_float = float(v)
    number_as_int = int(number_as_float)
    return number_as_int if number_as_float == number_as_int else number_as_float

回答 14

def num(s):
    """num(s)
    num(3),num(3.7)-->3
    num('3')-->3, num('3.7')-->3.7
    num('3,700')-->ValueError
    num('3a'),num('a3'),-->ValueError
    num('3e4') --> 30000.0
    """
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            raise ValueError('argument is not a string of number')
def num(s):
    """num(s)
    num(3),num(3.7)-->3
    num('3')-->3, num('3.7')-->3.7
    num('3,700')-->ValueError
    num('3a'),num('a3'),-->ValueError
    num('3e4') --> 30000.0
    """
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            raise ValueError('argument is not a string of number')

回答 15

您需要考虑到四舍五入才能正确执行此操作。

即int(5.1)=> 5 int(5.6)=> 5-错误,应该为6所以我们做int(5.6 + 0.5)=> 6

def convert(n):
    try:
        return int(n)
    except ValueError:
        return float(n + 0.5)

You need to take into account rounding to do this properly.

I.e. int(5.1) => 5 int(5.6) => 5 — wrong, should be 6 so we do int(5.6 + 0.5) => 6

def convert(n):
    try:
        return int(n)
    except ValueError:
        return float(n + 0.5)

回答 16

我很惊讶没有人提到正则表达式,因为有时必须在转换为数字之前准备好字符串并对其进行规范化

import re
def parseNumber(value, as_int=False):
    try:
        number = float(re.sub('[^.\-\d]', '', value))
        if as_int:
            return int(number + 0.5)
        else:
            return number
    except ValueError:
        return float('nan')  # or None if you wish

用法:

parseNumber('13,345')
> 13345.0

parseNumber('- 123 000')
> -123000.0

parseNumber('99999\n')
> 99999.0

顺便说一句,以验证您有一个数字:

import numbers
def is_number(value):
    return isinstance(value, numbers.Number)
    # will work with int, float, long, Decimal

I am surprised nobody mentioned regex because sometimes string must be prepared and normalized before casting to number

import re
def parseNumber(value, as_int=False):
    try:
        number = float(re.sub('[^.\-\d]', '', value))
        if as_int:
            return int(number + 0.5)
        else:
            return number
    except ValueError:
        return float('nan')  # or None if you wish

usage:

parseNumber('13,345')
> 13345.0

parseNumber('- 123 000')
> -123000.0

parseNumber('99999\n')
> 99999.0

and by the way, something to verify you have a number:

import numbers
def is_number(value):
    return isinstance(value, numbers.Number)
    # will work with int, float, long, Decimal

回答 17

要在python中进行类型转换,请使用该类型的构造函数,并将字符串(或您尝试投射的任何值)作为参数传递。

例如:

>>>float("23.333")
   23.333

在后台,python正在调用objects __float__方法,该方法应该返回参数的float表示形式。这是特别强大的功能,因为您可以使用__float__方法定义自己的类型(使用类),以便可以使用float(myobject)将其转换为float。

To typecast in python use the constructor funtions of the type, passing the string (or whatever value you are trying to cast) as a parameter.

For example:

>>>float("23.333")
   23.333

Behind the scenes, python is calling the objects __float__ method, which should return a float representation of the parameter. This is especially powerful, as you can define your own types (using classes) with a __float__ method so that it can be casted into a float using float(myobject).


回答 18

这是一个正确版本https://stackoverflow.com/a/33017514/5973334

这将尝试解析一个字符串并返回一个intfloat取决于该字符串表示什么。它可能会引发解析异常或具有某些意外行为

  def get_int_or_float(v):
        number_as_float = float(v)
        number_as_int = int(number_as_float)
        return number_as_int if number_as_float == number_as_int else 
        number_as_float

This is a corrected version of https://stackoverflow.com/a/33017514/5973334

This will try to parse a string and return either int or float depending on what the string represents. It might rise parsing exceptions or have some unexpected behaviour.

  def get_int_or_float(v):
        number_as_float = float(v)
        number_as_int = int(number_as_float)
        return number_as_int if number_as_float == number_as_int else 
        number_as_float

回答 19

将您的字符串传递给此函数:

def string_to_number(str):
  if("." in str):
    try:
      res = float(str)
    except:
      res = str  
  elif(str.isdigit()):
    res = int(str)
  else:
    res = str
  return(res)

根据所传递的内容,它将返回int,float或string。

一个int字符串

print(type(string_to_number("124")))
<class 'int'>

浮点数的字符串

print(type(string_to_number("12.4")))
<class 'float'>

字符串即字符串

print(type(string_to_number("hello")))
<class 'str'>

看起来像个浮点数的字符串

print(type(string_to_number("hel.lo")))
<class 'str'>

Pass your string to this function:

def string_to_number(str):
  if("." in str):
    try:
      res = float(str)
    except:
      res = str  
  elif(str.isdigit()):
    res = int(str)
  else:
    res = str
  return(res)

It will return int, float or string depending on what was passed.

string that is an int

print(type(string_to_number("124")))
<class 'int'>

string that is a float

print(type(string_to_number("12.4")))
<class 'float'>

string that is a string

print(type(string_to_number("hello")))
<class 'str'>

string that looks like a float

print(type(string_to_number("hel.lo")))
<class 'str'>

回答 20

采用:

def num(s):
    try:
        for each in s:
            yield int(each)
    except ValueError:
        yield float(each)
a = num(["123.55","345","44"])
print a.next()
print a.next()

这是我想出的最Python化的方式。

Use:

def num(s):
    try:
        for each in s:
            yield int(each)
    except ValueError:
        yield float(each)
a = num(["123.55","345","44"])
print a.next()
print a.next()

This is the most Pythonic way I could come up with.


回答 21

处理十六进制,八进制,二进制,十进制和浮点数

该解决方案将处理数字的所有字符串约定(我所知道的全部)。

def to_number(n):
    ''' Convert any number representation to a number 
    This covers: float, decimal, hex, and octal numbers.
    '''

    try:
        return int(str(n), 0)
    except:
        try:
            # python 3 doesn't accept "010" as a valid octal.  You must use the
            # '0o' prefix
            return int('0o' + n, 0)
        except:
            return float(n)

该测试用例输出说明了我在说什么。

======================== CAPTURED OUTPUT =========================
to_number(3735928559)   = 3735928559 == 3735928559
to_number("0xFEEDFACE") = 4277009102 == 4277009102
to_number("0x0")        =          0 ==          0
to_number(100)          =        100 ==        100
to_number("42")         =         42 ==         42
to_number(8)            =          8 ==          8
to_number("0o20")       =         16 ==         16
to_number("020")        =         16 ==         16
to_number(3.14)         =       3.14 ==       3.14
to_number("2.72")       =       2.72 ==       2.72
to_number("1e3")        =     1000.0 ==       1000
to_number(0.001)        =      0.001 ==      0.001
to_number("0xA")        =         10 ==         10
to_number("012")        =         10 ==         10
to_number("0o12")       =         10 ==         10
to_number("0b01010")    =         10 ==         10
to_number("10")         =         10 ==         10
to_number("10.0")       =       10.0 ==         10
to_number("1e1")        =       10.0 ==         10

这是测试:

class test_to_number(unittest.TestCase):

    def test_hex(self):
        # All of the following should be converted to an integer
        #
        values = [

                 #          HEX
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (0xDEADBEEF  , 3735928559), # Hex
                ("0xFEEDFACE", 4277009102), # Hex
                ("0x0"       ,          0), # Hex

                 #        Decimals
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (100         ,        100), # Decimal
                ("42"        ,         42), # Decimal
            ]



        values += [
                 #        Octals
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (0o10        ,          8), # Octal
                ("0o20"      ,         16), # Octal
                ("020"       ,         16), # Octal
            ]


        values += [
                 #        Floats
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (3.14        ,       3.14), # Float
                ("2.72"      ,       2.72), # Float
                ("1e3"       ,       1000), # Float
                (1e-3        ,      0.001), # Float
            ]

        values += [
                 #        All ints
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                ("0xA"       ,         10), 
                ("012"       ,         10), 
                ("0o12"      ,         10), 
                ("0b01010"   ,         10), 
                ("10"        ,         10), 
                ("10.0"      ,         10), 
                ("1e1"       ,         10), 
            ]

        for _input, expected in values:
            value = to_number(_input)

            if isinstance(_input, str):
                cmd = 'to_number("{}")'.format(_input)
            else:
                cmd = 'to_number({})'.format(_input)

            print("{:23} = {:10} == {:10}".format(cmd, value, expected))
            self.assertEqual(value, expected)

Handles hex, octal, binary, decimal, and float

This solution will handle all of the string conventions for numbers (all that I know about).

def to_number(n):
    ''' Convert any number representation to a number 
    This covers: float, decimal, hex, and octal numbers.
    '''

    try:
        return int(str(n), 0)
    except:
        try:
            # python 3 doesn't accept "010" as a valid octal.  You must use the
            # '0o' prefix
            return int('0o' + n, 0)
        except:
            return float(n)

This test case output illustrates what I’m talking about.

======================== CAPTURED OUTPUT =========================
to_number(3735928559)   = 3735928559 == 3735928559
to_number("0xFEEDFACE") = 4277009102 == 4277009102
to_number("0x0")        =          0 ==          0
to_number(100)          =        100 ==        100
to_number("42")         =         42 ==         42
to_number(8)            =          8 ==          8
to_number("0o20")       =         16 ==         16
to_number("020")        =         16 ==         16
to_number(3.14)         =       3.14 ==       3.14
to_number("2.72")       =       2.72 ==       2.72
to_number("1e3")        =     1000.0 ==       1000
to_number(0.001)        =      0.001 ==      0.001
to_number("0xA")        =         10 ==         10
to_number("012")        =         10 ==         10
to_number("0o12")       =         10 ==         10
to_number("0b01010")    =         10 ==         10
to_number("10")         =         10 ==         10
to_number("10.0")       =       10.0 ==         10
to_number("1e1")        =       10.0 ==         10

Here is the test:

class test_to_number(unittest.TestCase):

    def test_hex(self):
        # All of the following should be converted to an integer
        #
        values = [

                 #          HEX
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (0xDEADBEEF  , 3735928559), # Hex
                ("0xFEEDFACE", 4277009102), # Hex
                ("0x0"       ,          0), # Hex

                 #        Decimals
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (100         ,        100), # Decimal
                ("42"        ,         42), # Decimal
            ]



        values += [
                 #        Octals
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (0o10        ,          8), # Octal
                ("0o20"      ,         16), # Octal
                ("020"       ,         16), # Octal
            ]


        values += [
                 #        Floats
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                (3.14        ,       3.14), # Float
                ("2.72"      ,       2.72), # Float
                ("1e3"       ,       1000), # Float
                (1e-3        ,      0.001), # Float
            ]

        values += [
                 #        All ints
                 # ----------------------
                 # Input     |   Expected
                 # ----------------------
                ("0xA"       ,         10), 
                ("012"       ,         10), 
                ("0o12"      ,         10), 
                ("0b01010"   ,         10), 
                ("10"        ,         10), 
                ("10.0"      ,         10), 
                ("1e1"       ,         10), 
            ]

        for _input, expected in values:
            value = to_number(_input)

            if isinstance(_input, str):
                cmd = 'to_number("{}")'.format(_input)
            else:
                cmd = 'to_number({})'.format(_input)

            print("{:23} = {:10} == {:10}".format(cmd, value, expected))
            self.assertEqual(value, expected)

回答 22

采用:

>>> str_float = "545.2222"
>>> float(str_float)
545.2222
>>> type(_) # Check its type
<type 'float'>

>>> str_int = "31"
>>> int(str_int)
31
>>> type(_) # Check its type
<type 'int'>

Use:

>>> str_float = "545.2222"
>>> float(str_float)
545.2222
>>> type(_) # Check its type
<type 'float'>

>>> str_int = "31"
>>> int(str_int)
31
>>> type(_) # Check its type
<type 'int'>

回答 23

这是将转换任何一个函数object(不只是str)到intfloat方法,依据实际的字符串提供模样 intfloat。此外,如果它是同时具有__float__int__方法的对象,则默认使用__float__

def conv_to_num(x, num_type='asis'):
    '''Converts an object to a number if possible.
    num_type: int, float, 'asis'
    Defaults to floating point in case of ambiguity.
    '''
    import numbers

    is_num, is_str, is_other = [False]*3

    if isinstance(x, numbers.Number):
        is_num = True
    elif isinstance(x, str):
        is_str = True

    is_other = not any([is_num, is_str])

    if is_num:
        res = x
    elif is_str:
        is_float, is_int, is_char = [False]*3
        try:
            res = float(x)
            if '.' in x:
                is_float = True
            else:
                is_int = True
        except ValueError:
            res = x
            is_char = True

    else:
        if num_type == 'asis':
            funcs = [int, float]
        else:
            funcs = [num_type]

        for func in funcs:
            try:
                res = func(x)
                break
            except TypeError:
                continue
        else:
            res = x

This is a function which will convert any object (not just str) to int or float, based on if the actual string supplied looks like int or float. Further if it’s an object which has both __float and __int__ methods, it defaults to using __float__

def conv_to_num(x, num_type='asis'):
    '''Converts an object to a number if possible.
    num_type: int, float, 'asis'
    Defaults to floating point in case of ambiguity.
    '''
    import numbers

    is_num, is_str, is_other = [False]*3

    if isinstance(x, numbers.Number):
        is_num = True
    elif isinstance(x, str):
        is_str = True

    is_other = not any([is_num, is_str])

    if is_num:
        res = x
    elif is_str:
        is_float, is_int, is_char = [False]*3
        try:
            res = float(x)
            if '.' in x:
                is_float = True
            else:
                is_int = True
        except ValueError:
            res = x
            is_char = True

    else:
        if num_type == 'asis':
            funcs = [int, float]
        else:
            funcs = [num_type]

        for func in funcs:
            try:
                res = func(x)
                break
            except TypeError:
                continue
        else:
            res = x

回答 24

通过使用int和float方法,我们可以将字符串转换为整数和浮点数。

s="45.8"
print(float(s))

y='67'
print(int(y))

By using int and float methods we can convert a string to integer and floats.

s="45.8"
print(float(s))

y='67'
print(int(y))

回答 25

eval()是这个问题的很好解决方案。它不需要检查数字是int还是float,它只给出相应的等价物。如果需要其他方法,请尝试

if '.' in string:
    print(float(string))
else:
    print(int(string))

try-except也可以用作替代方法。尝试在try块中将字符串转换为int。如果该字符串是一个浮点值,它将抛出一个错误,该错误将在except块中捕获,像这样

try:
    print(int(string))
except:
    print(float(string))

eval() is a very good solution to this question. It doesn’t need to check if the number is int or float, it just gives the corresponding equivalent. If other methods are required, try

if '.' in string:
    print(float(string))
else:
    print(int(string))

try-except can also be used as an alternative. Try converting string to int inside the try block. If the string would be a float value, it will throw an error which will be catched in the except block, like this

try:
    print(int(string))
except:
    print(float(string))

回答 26

这是您问题的另一种解释(提示:含糊)。您可能正在寻找这样的东西:

def parseIntOrFloat( aString ):
    return eval( aString )

它是这样的…

>>> parseIntOrFloat("545.2222")
545.22220000000004
>>> parseIntOrFloat("545")
545

从理论上讲,存在注入漏洞。字符串可以是例如"import os; os.abort()"。但是,由于没有关于字符串来自何处的任何背景,因此可能是理论上的推测。由于问题很模糊,因此尚不清楚此漏洞是否确实存在。

Here’s another interpretation of your question (hint: it’s vague). It’s possible you’re looking for something like this:

def parseIntOrFloat( aString ):
    return eval( aString )

It works like this…

>>> parseIntOrFloat("545.2222")
545.22220000000004
>>> parseIntOrFloat("545")
545

Theoretically, there’s an injection vulnerability. The string could, for example be "import os; os.abort()". Without any background on where the string comes from, however, the possibility is theoretical speculation. Since the question is vague, it’s not at all clear if this vulnerability actually exists or not.


为什么Python无法解析此JSON数据?

问题:为什么Python无法解析此JSON数据?

我在文件中有此JSON:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": [
        "id": "valore"
    ],
    "om_points": "value",
    "parameters": [
        "id": "valore"
    ]
}

我编写了以下脚本来打印所有JSON数据:

import json
from pprint import pprint

with open('data.json') as f:
    data = json.load(f)

pprint(data)

但是,该程序会引发异常:

Traceback (most recent call last):
  File "<pyshell#1>", line 5, in <module>
    data = json.load(f)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 13 column 13 (char 213)

如何解析JSON并提取其值?

I have this JSON in a file:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": [
        "id": "valore"
    ],
    "om_points": "value",
    "parameters": [
        "id": "valore"
    ]
}

I wrote this script to print all of the JSON data:

import json
from pprint import pprint

with open('data.json') as f:
    data = json.load(f)

pprint(data)

This program raises an exception, though:

Traceback (most recent call last):
  File "<pyshell#1>", line 5, in <module>
    data = json.load(f)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 13 column 13 (char 213)

How can I parse the JSON and extract its values?


回答 0

您的数据不是有效的JSON格式。您有[]什么时候应该拥有{}

  • []用于JSON数组,list在Python 中称为
  • {}用于JSON对象(dict在Python 中称为JSON对象)

JSON文件的外观如下:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": {
        "id": "valore"
    },
    "om_points": "value",
    "parameters": {
        "id": "valore"
    }
}

然后,您可以使用您的代码:

import json
from pprint import pprint

with open('data.json') as f:
    data = json.load(f)

pprint(data)

使用数据,您现在还可以找到类似的值:

data["maps"][0]["id"]
data["masks"]["id"]
data["om_points"]

试试看,看看是否有意义。

Your data is not valid JSON format. You have [] when you should have {}:

  • [] are for JSON arrays, which are called list in Python
  • {} are for JSON objects, which are called dict in Python

Here’s how your JSON file should look:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": {
        "id": "valore"
    },
    "om_points": "value",
    "parameters": {
        "id": "valore"
    }
}

Then you can use your code:

import json
from pprint import pprint

with open('data.json') as f:
    data = json.load(f)

pprint(data)

With data, you can now also find values like so:

data["maps"][0]["id"]
data["masks"]["id"]
data["om_points"]

Try those out and see if it starts to make sense.


回答 1

data.json应该看起来像这样:

{
 "maps":[
         {"id":"blabla","iscategorical":"0"},
         {"id":"blabla","iscategorical":"0"}
        ],
"masks":
         {"id":"valore"},
"om_points":"value",
"parameters":
         {"id":"valore"}
}

您的代码应为:

import json
from pprint import pprint

with open('data.json') as data_file:    
    data = json.load(data_file)
pprint(data)

请注意,这仅在Python 2.6及更高版本中有效,因为它取决于with-statement。在Python 2.5中使用from __future__ import with_statement,在Python <= 2.4中,请参见Justin Peel的答案,该答案基于该答案。

您现在还可以像这样访问单个值:

data["maps"][0]["id"]  # will return 'blabla'
data["masks"]["id"]    # will return 'valore'
data["om_points"]      # will return 'value'

Your data.json should look like this:

{
 "maps":[
         {"id":"blabla","iscategorical":"0"},
         {"id":"blabla","iscategorical":"0"}
        ],
"masks":
         {"id":"valore"},
"om_points":"value",
"parameters":
         {"id":"valore"}
}

Your code should be:

import json
from pprint import pprint

with open('data.json') as data_file:    
    data = json.load(data_file)
pprint(data)

Note that this only works in Python 2.6 and up, as it depends upon the with-statement. In Python 2.5 use from __future__ import with_statement, in Python <= 2.4, see Justin Peel’s answer, which this answer is based upon.

You can now also access single values like this:

data["maps"][0]["id"]  # will return 'blabla'
data["masks"]["id"]    # will return 'valore'
data["om_points"]      # will return 'value'

回答 2

贾斯汀·皮尔(Justin Peel)的回答确实很有帮助,但是,如果您使用的是Python 3,则应按以下方式读取JSON:

with open('data.json', encoding='utf-8') as data_file:
    data = json.loads(data_file.read())

注意:使用json.loads代替json.load。在Python 3中,json.loads采用字符串参数。json.load采用类似文件的对象参数。data_file.read()返回一个字符串对象。

老实说,在大多数情况下,将所有json数据加载到内存中都不是问题。

Justin Peel’s answer is really helpful, but if you are using Python 3 reading JSON should be done like this:

with open('data.json', encoding='utf-8') as data_file:
    data = json.loads(data_file.read())

Note: use json.loads instead of json.load. In Python 3, json.loads takes a string parameter. json.load takes a file-like object parameter. data_file.read() returns a string object.

To be honest, I don’t think it’s a problem to load all json data into memory most cases.


回答 3

data = []
with codecs.open('d:\output.txt','rU','utf-8') as f:
    for line in f:
       data.append(json.loads(line))
data = []
with codecs.open('d:\output.txt','rU','utf-8') as f:
    for line in f:
       data.append(json.loads(line))

回答 4

“超JSON”或简称“ ujson”可以处理[]您的JSON文件输入中的内容。如果您正在将程序中的JSON输入文件作为JSON元素列表读取;例如,[{[{}]}, {}, [], etc...]ujson可以处理字典列表的任何任意顺序,即列表字典。

您可以在Python包索引中找到ujson,并且该API与Python的内置json库几乎相同。

如果您要加载较大的JSON文件,则ujson也会更快。与提供的相同链接中的其他Python JSON库相比,您可以看到性能详细信息。

“Ultra JSON” or simply “ujson” can handle having [] in your JSON file input. If you’re reading a JSON input file into your program as a list of JSON elements; such as, [{[{}]}, {}, [], etc...] ujson can handle any arbitrary order of lists of dictionaries, dictionaries of lists.

You can find ujson in the Python package index and the API is almost identical to Python’s built-in json library.

ujson is also much faster if you’re loading larger JSON files. You can see the performance details in comparison to other Python JSON libraries in the same link provided.


回答 5

如果您使用的是Python3,则可以尝试将(connection.json文件)JSON 更改为:

{
  "connection1": {
    "DSN": "con1",
    "UID": "abc",
    "PWD": "1234",
    "connection_string_python":"test1"
  }
  ,
  "connection2": {
    "DSN": "con2",
    "UID": "def",
    "PWD": "1234"
  }
}

然后使用以下代码:

connection_file = open('connection.json', 'r')
conn_string = json.load(connection_file)
conn_string['connection1']['connection_string_python'])
connection_file.close()
>>> test1

If you’re using Python3, you can try changing your (connection.json file) JSON to:

{
  "connection1": {
    "DSN": "con1",
    "UID": "abc",
    "PWD": "1234",
    "connection_string_python":"test1"
  }
  ,
  "connection2": {
    "DSN": "con2",
    "UID": "def",
    "PWD": "1234"
  }
}

Then using the following code:

connection_file = open('connection.json', 'r')
conn_string = json.load(connection_file)
conn_string['connection1']['connection_string_python'])
connection_file.close()
>>> test1

回答 6

在这里,您可以使用修改后的data.json文件:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": [{
        "id": "valore"
    }],
    "om_points": "value",
    "parameters": [{
        "id": "valore"
    }]
}

您可以使用以下几行在控制台上调用或打印数据:

import json
from pprint import pprint
with open('data.json') as data_file:
    data_item = json.load(data_file)
pprint(data_item)

预期输出print(data_item['parameters'][0]['id'])

{'maps': [{'id': 'blabla', 'iscategorical': '0'},
          {'id': 'blabla', 'iscategorical': '0'}],
 'masks': [{'id': 'valore'}],
 'om_points': 'value',
 'parameters': [{'id': 'valore'}]}

预期输出print(data_item['parameters'][0]['id'])

valore

Here you go with modified data.json file:

{
    "maps": [
        {
            "id": "blabla",
            "iscategorical": "0"
        },
        {
            "id": "blabla",
            "iscategorical": "0"
        }
    ],
    "masks": [{
        "id": "valore"
    }],
    "om_points": "value",
    "parameters": [{
        "id": "valore"
    }]
}

You can call or print data on console by using below lines:

import json
from pprint import pprint
with open('data.json') as data_file:
    data_item = json.load(data_file)
pprint(data_item)

Expected output for print(data_item['parameters'][0]['id']):

{'maps': [{'id': 'blabla', 'iscategorical': '0'},
          {'id': 'blabla', 'iscategorical': '0'}],
 'masks': [{'id': 'valore'}],
 'om_points': 'value',
 'parameters': [{'id': 'valore'}]}

Expected output for print(data_item['parameters'][0]['id']):

valore

回答 7

该解析有两种类型。

  1. 从系统路径解析文件中的数据
  2. 从远程URL解析JSON。

从文件中,您可以使用以下内容

import json
json = json.loads(open('/path/to/file.json').read())
value = json['key']
print json['value']

该小节解释了使用两种情况的完整解析和获取值。使用Python解析JSON

There are two types in this parsing.

  1. Parsing data from a file from a system path
  2. Parsing JSON from remote URL.

From a file, you can use the following

import json
json = json.loads(open('/path/to/file.json').read())
value = json['key']
print json['value']

This arcticle explains the full parsing and getting values using two scenarios.Parsing JSON using Python


回答 8

作为python3用户

loadloads方法之间的区别非常重要,尤其是当您从文件中读取json数据时。

如文档中所述:

json.load:

使用此转换表将fp(支持.read()的文本文件或包含JSON文档的二进制文件)反序列化为Python对象。

json.loads:

json.loads:使用此转换表将s(包含JSON文档的str,字节或字节数组实例)反序列化为Python对象。

json.load方法可以读取二进制文件,因此可以直接读取打开的json文档。

with open('./recipes.json') as data:
  all_recipes = json.load(data)

结果,您的json数据以根据此转换表指定的格式可用:

https://docs.python.org/3.7/library/json.html#json-to-py-table

As a python3 user,

The difference between load and loads methods is important especially when you read json data from file.

As stated in the docs:

json.load:

Deserialize fp (a .read()-supporting text file or binary file containing a JSON document) to a Python object using this conversion table.

json.loads:

json.loads: Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object using this conversion table.

json.load method can directly read opened json document since it is able to read binary file.

with open('./recipes.json') as data:
  all_recipes = json.load(data)

As a result, your json data available as in a format specified according to this conversion table:

https://docs.python.org/3.7/library/json.html#json-to-py-table