标签归档:serialization

Python中泡菜的常见用例

问题:Python中泡菜的常见用例

我看过泡菜文档,但是我不知道泡菜在哪里有用。

泡菜有哪些常见用例?

I’ve looked at the pickle documentation, but I don’t understand where pickle is useful.

What are some common use-cases for pickle?


回答 0

我遇到的一些用途:

1)将程序的状态数据保存到磁盘,以便它可以在重新启动时从中断处继续执行(持久性)

2)在多核或分布式系统中通过TCP连接发送python数据(编组)

3)将python对象存储在数据库中

4)将任意python对象转换为字符串,以便可以将其用作字典键(例如,用于缓存和备忘录)。

最后一个存在一些问题-两个相同的对象可以被腌制并导致不同的字符串-甚至相同的对象两次被腌制也可以具有不同的表示形式。这是因为泡菜可以包括参考计数信息。

为了强调@lunaryorn的评论-切勿从不可靠的来源获取字符串,因为精心制作的pickle可以在系统上执行任意代码。例如,请参阅https://blog.nelhage.com/2011/03/exploiting-pickle/

Some uses that I have come across:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)

2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)

3) storing python objects in a database

4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/


回答 1

最小往返次数示例

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

编辑:但作为酸洗的现实世界的例子的问题,也许最先进的使用酸洗的(你必须相当深挖掘到源)ZODB: http://svn.zope.org/

否则,PyPI会提到几个:http ://pypi.python.org/pypi?:action=search&term=pickle&submit=search

我个人已经看到了几个通过网络发送的腌制对象的示例,它们是一种易于使用的网络传输协议。

Minimal roundtrip example..

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB: http://svn.zope.org/

Otherwise, PyPI mentions several: http://pypi.python.org/pypi?:action=search&term=pickle&submit=search

I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.


回答 2

酸洗对于分布式和并行计算绝对必要。

假设您要使用并行映射简化multiprocessing(或使用pyina跨群集节点),那么您需要确保要在并行资源上映射的函数可以腌制。如果没有腌制,则无法将其发送到其他进程,计算机等上的其他资源。另请参见此处的示例。

为此,我使用dill,它可以在python中序列化几乎所有内容。Dill还有一些很好的工具,可以帮助您了解在代码失败时导致酸洗失败的原因。

而且,是的,人们使用挑选来保存计算状态,您的ipython会话等。

Pickling is absolutely necessary for distributed and parallel computing.

Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.

To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.

And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.


回答 3

我已经在我的一个项目中使用了它。如果该应用在工作期间终止(它完成了冗长的任务并处理了许多数据),那么我需要保存整个数据结构,并在再次运行该应用后重新加载它。我之所以使用cPickle,是因为速度至关重要,并且数据量确实很大。

I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.


回答 4

对于您的数据结构和类,Pickle类似于“另存为..”和“打开..”。假设我要保存数据结构,以便在程序运行之间保持持久性。

保存:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

正在加载:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

现在,我不必从头开始重新构建myStuff,而我可以从上次停止的地方继续学习。

Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.

Saving:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

Loading:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.


回答 5

对于初学者(就像我一样),很难理解为什么在阅读官方文档时首先使用泡菜。可能是因为文档暗示您已经知道序列化的全部目的。仅在阅读了序列化的一般说明之后,我才了解该模块的原因及其常见用例。不考虑特定编程语言的序列化的广泛解释也可能会有所帮助:https : //stackoverflow.com/a/14482962/4383472什么是序列化?https://stackoverflow.com/a/3984483/4383472

For the beginner (as is the case with me) it’s really hard to understand why use pickle in the first place when reading the official documentation. It’s maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help: https://stackoverflow.com/a/14482962/4383472, What is serialization?, https://stackoverflow.com/a/3984483/4383472


回答 6

要添加一个真实的示例:用于Python 的Sphinx文档工具使用pickle来缓存已解析的文档和文档之间的交叉引用,以加快文档的后续构建。

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.


回答 7

我可以告诉你我使用它的用途,并且已经看到它的用途:

  • 游戏资料保存
  • 游戏数据可以像生命和健康一样保存
  • 以前输入程序的说号的记录

那些是我至少用过的

I can tell you the uses I use it for and have seen it used for:

  • Game profile saves
  • Game data saves like lives and health
  • Previous records of say numbers inputed to a program

Those are the ones I use it for at least


回答 8

当时,我在网站的一个网站上进行网页爬取时使用了腌制,因此我想存储超过8000k的URL,并希望尽快处理它们,所以我使用腌制是因为它的输出质量非常高。

您可以轻松地到达url,甚至在作业目录关键字停止的位置也可以非常快速地获取url详细信息以恢复该过程。

I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.

you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.


在Python中,如何将YAML映射加载为OrderedDicts?

问题:在Python中,如何将YAML映射加载为OrderedDicts?

我想让PyYAML的加载器将映射(和有序映射)加载到Python 2.7+ OrderedDict类型中,而不是dict它当前使用的普通和​​对列表。

最好的方法是什么?

I’d like to get PyYAML‘s loader to load mappings (and ordered mappings) into the Python 2.7+ OrderedDict type, instead of the vanilla dict and the list of pairs it currently uses.

What’s the best way to do that?


回答 0

更新:在python 3.6+中OrderedDict,由于新的dict实现已经在pypy中使用了一段时间(尽管现在考虑了CPython实现的细节),您可能根本不需要。

更新:在python 3.7+中,dict对象的插入顺序保留性质已声明为Python语言规范的正式组成部分,请参阅Python 3.7的新增功能

我喜欢@James的简单解决方案。但是,它更改了默认的全局yaml.Loader类,这可能导致麻烦的副作用。特别是在编写库代码时,这是一个坏主意。另外,它不能直接与使用yaml.safe_load()

幸运的是,无需付出太多努力即可改进该解决方案:

import yaml
from collections import OrderedDict

def ordered_load(stream, Loader=yaml.Loader, object_pairs_hook=OrderedDict):
    class OrderedLoader(Loader):
        pass
    def construct_mapping(loader, node):
        loader.flatten_mapping(node)
        return object_pairs_hook(loader.construct_pairs(node))
    OrderedLoader.add_constructor(
        yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
        construct_mapping)
    return yaml.load(stream, OrderedLoader)

# usage example:
ordered_load(stream, yaml.SafeLoader)

对于序列化,我不知道明显的概括,但是至少这应该没有任何副作用:

def ordered_dump(data, stream=None, Dumper=yaml.Dumper, **kwds):
    class OrderedDumper(Dumper):
        pass
    def _dict_representer(dumper, data):
        return dumper.represent_mapping(
            yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
            data.items())
    OrderedDumper.add_representer(OrderedDict, _dict_representer)
    return yaml.dump(data, stream, OrderedDumper, **kwds)

# usage:
ordered_dump(data, Dumper=yaml.SafeDumper)

Update: In python 3.6+ you probably don’t need OrderedDict at all due to the new dict implementation that has been in use in pypy for some time (although considered CPython implementation detail for now).

Update: In python 3.7+, the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec, see What’s New In Python 3.7.

I like @James’ solution for its simplicity. However, it changes the default global yaml.Loader class, which can lead to troublesome side effects. Especially, when writing library code this is a bad idea. Also, it doesn’t directly work with yaml.safe_load().

Fortunately, the solution can be improved without much effort:

import yaml
from collections import OrderedDict

def ordered_load(stream, Loader=yaml.Loader, object_pairs_hook=OrderedDict):
    class OrderedLoader(Loader):
        pass
    def construct_mapping(loader, node):
        loader.flatten_mapping(node)
        return object_pairs_hook(loader.construct_pairs(node))
    OrderedLoader.add_constructor(
        yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
        construct_mapping)
    return yaml.load(stream, OrderedLoader)

# usage example:
ordered_load(stream, yaml.SafeLoader)

For serialization, I don’t know an obvious generalization, but at least this shouldn’t have any side effects:

def ordered_dump(data, stream=None, Dumper=yaml.Dumper, **kwds):
    class OrderedDumper(Dumper):
        pass
    def _dict_representer(dumper, data):
        return dumper.represent_mapping(
            yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
            data.items())
    OrderedDumper.add_representer(OrderedDict, _dict_representer)
    return yaml.dump(data, stream, OrderedDumper, **kwds)

# usage:
ordered_dump(data, Dumper=yaml.SafeDumper)

回答 1

yaml模块允许您指定自定义“表示器”以将Python对象转换为文本,并指定“构造器”以逆转该过程。

_mapping_tag = yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG

def dict_representer(dumper, data):
    return dumper.represent_dict(data.iteritems())

def dict_constructor(loader, node):
    return collections.OrderedDict(loader.construct_pairs(node))

yaml.add_representer(collections.OrderedDict, dict_representer)
yaml.add_constructor(_mapping_tag, dict_constructor)

The yaml module allow you to specify custom ‘representers’ to convert Python objects to text and ‘constructors’ to reverse the process.

_mapping_tag = yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG

def dict_representer(dumper, data):
    return dumper.represent_dict(data.iteritems())

def dict_constructor(loader, node):
    return collections.OrderedDict(loader.construct_pairs(node))

yaml.add_representer(collections.OrderedDict, dict_representer)
yaml.add_constructor(_mapping_tag, dict_constructor)

回答 2

2018年选项:

oyamlPyYAML的直接替代品,保留了字典排序。同时支持Python 2和Python 3。只需pip install oyaml导入,如下所示:

import oyaml as yaml

在转储/加载时,您将不再为搞砸的映射而烦恼。

注意:我是oyaml的作者。

2018 option:

oyaml is a drop-in replacement for PyYAML which preserves dict ordering. Both Python 2 and Python 3 are supported. Just pip install oyaml, and import as shown below:

import oyaml as yaml

You’ll no longer be annoyed by screwed-up mappings when dumping/loading.

Note: I’m the author of oyaml.


回答 3

2015(及更高版本)选项:

ruamel.yaml是PyYAML的替代品(免责声明:我是该软件包的作者)。保留映射的顺序是2015年在第一版(0.1)中添加的内容之一。它不仅保留字典的顺序,还保留注释,锚点名称,标签并支持YAML 1.2规范(2009年发布)

规范说,不能保证排序,但是YAML文件中当然有排序,并且适当的解析器可以仅保留该排序器,并透明地生成一个保持排序的对象。您只需要选择正确的解析器,加载器和转储器¹:

import sys
from ruamel.yaml import YAML

yaml_str = """\
3: abc
conf:
    10: def
    3: gij     # h is missing
more:
- what
- else
"""

yaml = YAML()
data = yaml.load(yaml_str)
data['conf'][10] = 'klm'
data['conf'][3] = 'jig'
yaml.dump(data, sys.stdout)

会给你:

3: abc
conf:
  10: klm
  3: jig       # h is missing
more:
- what
- else

dataCommentedMap类似dict 的类型,但具有额外的信息,这些信息会一直保留直到被转储(包括保留的注释!)

2015 (and later) option:

ruamel.yaml is a drop in replacement for PyYAML (disclaimer: I am the author of that package). Preserving the order of the mappings was one of the things added in the first version (0.1) back in 2015. Not only does it preserve the order of your dictionaries, it will also preserve comments, anchor names, tags and does support the YAML 1.2 specification (released 2009)

The specification says that the ordering is not guaranteed, but of course there is ordering in the YAML file and the appropriate parser can just hold on to that and transparently generate an object that keeps the ordering. You just need to choose the right parser, loader and dumper¹:

import sys
from ruamel.yaml import YAML

yaml_str = """\
3: abc
conf:
    10: def
    3: gij     # h is missing
more:
- what
- else
"""

yaml = YAML()
data = yaml.load(yaml_str)
data['conf'][10] = 'klm'
data['conf'][3] = 'jig'
yaml.dump(data, sys.stdout)

will give you:

3: abc
conf:
  10: klm
  3: jig       # h is missing
more:
- what
- else

data is of type CommentedMap which functions like a dict, but has extra information that is kept around until being dumped (including the preserved comment!)


回答 4

注意:有一个基于以下答案的库,该库还实现了CLoader和CDumpers:Phynix / yamlloader

我非常怀疑这是最好的方法,但这是我想出的方法,并且确实有效。也可作为要点

import yaml
import yaml.constructor

try:
    # included in standard lib from Python 2.7
    from collections import OrderedDict
except ImportError:
    # try importing the backported drop-in replacement
    # it's available on PyPI
    from ordereddict import OrderedDict

class OrderedDictYAMLLoader(yaml.Loader):
    """
    A YAML loader that loads mappings into ordered dictionaries.
    """

    def __init__(self, *args, **kwargs):
        yaml.Loader.__init__(self, *args, **kwargs)

        self.add_constructor(u'tag:yaml.org,2002:map', type(self).construct_yaml_map)
        self.add_constructor(u'tag:yaml.org,2002:omap', type(self).construct_yaml_map)

    def construct_yaml_map(self, node):
        data = OrderedDict()
        yield data
        value = self.construct_mapping(node)
        data.update(value)

    def construct_mapping(self, node, deep=False):
        if isinstance(node, yaml.MappingNode):
            self.flatten_mapping(node)
        else:
            raise yaml.constructor.ConstructorError(None, None,
                'expected a mapping node, but found %s' % node.id, node.start_mark)

        mapping = OrderedDict()
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            try:
                hash(key)
            except TypeError, exc:
                raise yaml.constructor.ConstructorError('while constructing a mapping',
                    node.start_mark, 'found unacceptable key (%s)' % exc, key_node.start_mark)
            value = self.construct_object(value_node, deep=deep)
            mapping[key] = value
        return mapping

Note: there is a library, based on the following answer, which implements also the CLoader and CDumpers: Phynix/yamlloader

I doubt very much that this is the best way to do it, but this is the way I came up with, and it does work. Also available as a gist.

import yaml
import yaml.constructor

try:
    # included in standard lib from Python 2.7
    from collections import OrderedDict
except ImportError:
    # try importing the backported drop-in replacement
    # it's available on PyPI
    from ordereddict import OrderedDict

class OrderedDictYAMLLoader(yaml.Loader):
    """
    A YAML loader that loads mappings into ordered dictionaries.
    """

    def __init__(self, *args, **kwargs):
        yaml.Loader.__init__(self, *args, **kwargs)

        self.add_constructor(u'tag:yaml.org,2002:map', type(self).construct_yaml_map)
        self.add_constructor(u'tag:yaml.org,2002:omap', type(self).construct_yaml_map)

    def construct_yaml_map(self, node):
        data = OrderedDict()
        yield data
        value = self.construct_mapping(node)
        data.update(value)

    def construct_mapping(self, node, deep=False):
        if isinstance(node, yaml.MappingNode):
            self.flatten_mapping(node)
        else:
            raise yaml.constructor.ConstructorError(None, None,
                'expected a mapping node, but found %s' % node.id, node.start_mark)

        mapping = OrderedDict()
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            try:
                hash(key)
            except TypeError, exc:
                raise yaml.constructor.ConstructorError('while constructing a mapping',
                    node.start_mark, 'found unacceptable key (%s)' % exc, key_node.start_mark)
            value = self.construct_object(value_node, deep=deep)
            mapping[key] = value
        return mapping

回答 5

更新:不赞成使用该库,而推荐使用yamlloader(它基于yamlordereddictloader)

我刚刚找到了一个Python库(https://pypi.python.org/pypi/yamlordereddictloader/0.1.1),该库是基于此问题的答案而创建的,使用起来非常简单:

import yaml
import yamlordereddictloader

datas = yaml.load(open('myfile.yml'), Loader=yamlordereddictloader.Loader)

Update: the library was deprecated in favor of the yamlloader (which is based on the yamlordereddictloader)

I’ve just found a Python library (https://pypi.python.org/pypi/yamlordereddictloader/0.1.1) which was created based on answers to this question and is quite simple to use:

import yaml
import yamlordereddictloader

datas = yaml.load(open('myfile.yml'), Loader=yamlordereddictloader.Loader)

回答 6

在针对Python 2.7的For PyYaml安装中,我更新了__init __。py,constructor.py和loader.py。现在支持用于加载命令的object_pairs_hook选项。我所做的更改差异如下。

__init__.py

$ diff __init__.py Original
64c64
< def load(stream, Loader=Loader, **kwds):
---
> def load(stream, Loader=Loader):
69c69
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)
75c75
< def load_all(stream, Loader=Loader, **kwds):
---
> def load_all(stream, Loader=Loader):
80c80
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)

constructor.py

$ diff constructor.py Original
20,21c20
<     def __init__(self, object_pairs_hook=dict):
<         self.object_pairs_hook = object_pairs_hook
---
>     def __init__(self):
27,29d25
<     def create_object_hook(self):
<         return self.object_pairs_hook()
<
54,55c50,51
<         self.constructed_objects = self.create_object_hook()
<         self.recursive_objects = self.create_object_hook()
---
>         self.constructed_objects = {}
>         self.recursive_objects = {}
129c125
<         mapping = self.create_object_hook()
---
>         mapping = {}
400c396
<         data = self.create_object_hook()
---
>         data = {}
595c591
<             dictitems = self.create_object_hook()
---
>             dictitems = {}
602c598
<             dictitems = value.get('dictitems', self.create_object_hook())
---
>             dictitems = value.get('dictitems', {})

loader.py

$ diff loader.py Original
13c13
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
18c18
<         BaseConstructor.__init__(self, **constructKwds)
---
>         BaseConstructor.__init__(self)
23c23
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
28c28
<         SafeConstructor.__init__(self, **constructKwds)
---
>         SafeConstructor.__init__(self)
33c33
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
38c38
<         Constructor.__init__(self, **constructKwds)
---
>         Constructor.__init__(self)

On my For PyYaml installation for Python 2.7 I updated __init__.py, constructor.py, and loader.py. Now supports object_pairs_hook option for load commands. Diff of changes I made is below.

__init__.py

$ diff __init__.py Original
64c64
< def load(stream, Loader=Loader, **kwds):
---
> def load(stream, Loader=Loader):
69c69
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)
75c75
< def load_all(stream, Loader=Loader, **kwds):
---
> def load_all(stream, Loader=Loader):
80c80
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)

constructor.py

$ diff constructor.py Original
20,21c20
<     def __init__(self, object_pairs_hook=dict):
<         self.object_pairs_hook = object_pairs_hook
---
>     def __init__(self):
27,29d25
<     def create_object_hook(self):
<         return self.object_pairs_hook()
<
54,55c50,51
<         self.constructed_objects = self.create_object_hook()
<         self.recursive_objects = self.create_object_hook()
---
>         self.constructed_objects = {}
>         self.recursive_objects = {}
129c125
<         mapping = self.create_object_hook()
---
>         mapping = {}
400c396
<         data = self.create_object_hook()
---
>         data = {}
595c591
<             dictitems = self.create_object_hook()
---
>             dictitems = {}
602c598
<             dictitems = value.get('dictitems', self.create_object_hook())
---
>             dictitems = value.get('dictitems', {})

loader.py

$ diff loader.py Original
13c13
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
18c18
<         BaseConstructor.__init__(self, **constructKwds)
---
>         BaseConstructor.__init__(self)
23c23
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
28c28
<         SafeConstructor.__init__(self, **constructKwds)
---
>         SafeConstructor.__init__(self)
33c33
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
38c38
<         Constructor.__init__(self, **constructKwds)
---
>         Constructor.__init__(self)

回答 7

这是一个简单的解决方案,还可以检查地图中是否有重复的顶级键。

import yaml
import re
from collections import OrderedDict

def yaml_load_od(fname):
    "load a yaml file as an OrderedDict"
    # detects any duped keys (fail on this) and preserves order of top level keys
    with open(fname, 'r') as f:
        lines = open(fname, "r").read().splitlines()
        top_keys = []
        duped_keys = []
        for line in lines:
            m = re.search(r'^([A-Za-z0-9_]+) *:', line)
            if m:
                if m.group(1) in top_keys:
                    duped_keys.append(m.group(1))
                else:
                    top_keys.append(m.group(1))
        if duped_keys:
            raise Exception('ERROR: duplicate keys: {}'.format(duped_keys))
    # 2nd pass to set up the OrderedDict
    with open(fname, 'r') as f:
        d_tmp = yaml.load(f)
    return OrderedDict([(key, d_tmp[key]) for key in top_keys])

here’s a simple solution that also checks for duplicated top level keys in your map.

import yaml
import re
from collections import OrderedDict

def yaml_load_od(fname):
    "load a yaml file as an OrderedDict"
    # detects any duped keys (fail on this) and preserves order of top level keys
    with open(fname, 'r') as f:
        lines = open(fname, "r").read().splitlines()
        top_keys = []
        duped_keys = []
        for line in lines:
            m = re.search(r'^([A-Za-z0-9_]+) *:', line)
            if m:
                if m.group(1) in top_keys:
                    duped_keys.append(m.group(1))
                else:
                    top_keys.append(m.group(1))
        if duped_keys:
            raise Exception('ERROR: duplicate keys: {}'.format(duped_keys))
    # 2nd pass to set up the OrderedDict
    with open(fname, 'r') as f:
        d_tmp = yaml.load(f)
    return OrderedDict([(key, d_tmp[key]) for key in top_keys])

将Enum成员序列化为JSON

问题:将Enum成员序列化为JSON

如何将PythonEnum成员序列化为JSON,以便可以将生成的JSON反序列化为Python对象?

例如,此代码:

from enum import Enum    
import json

class Status(Enum):
    success = 0

json.dumps(Status.success)

导致错误:

TypeError: <Status.success: 0> is not JSON serializable

我该如何避免呢?

How do I serialise a Python Enum member to JSON, so that I can deserialise the resulting JSON back into a Python object?

For example, this code:

from enum import Enum    
import json

class Status(Enum):
    success = 0

json.dumps(Status.success)

results in the error:

TypeError: <Status.success: 0> is not JSON serializable

How can I avoid that?


回答 0

如果您想将任意enum.Enum成员编码为JSON,然后将其解码为相同的enum成员(而不是简单的enum成员的value属性),则可以编写一个自定义JSONEncoder类,并使用一个解码函数作为object_hook参数传递给json.load()or来实现json.loads()

PUBLIC_ENUMS = {
    'Status': Status,
    # ...
}

class EnumEncoder(json.JSONEncoder):
    def default(self, obj):
        if type(obj) in PUBLIC_ENUMS.values():
            return {"__enum__": str(obj)}
        return json.JSONEncoder.default(self, obj)

def as_enum(d):
    if "__enum__" in d:
        name, member = d["__enum__"].split(".")
        return getattr(PUBLIC_ENUMS[name], member)
    else:
        return d

as_enum函数依赖于已使用EnumEncoder或类似行为进行编码的JSON 。

对成员的限制PUBLIC_ENUMS是必要的,以避免使用恶意制作的文本来(例如)欺骗调用代码以将私有信息(例如,应用程序使用的密钥)保存到不相关的数据库字段中,然后从该字段中将其公开(请参阅http://chat.stackoverflow.com/transcript/message/35999686#35999686)。

用法示例:

>>> data = {
...     "action": "frobnicate",
...     "status": Status.success
... }
>>> text = json.dumps(data, cls=EnumEncoder)
>>> text
'{"status": {"__enum__": "Status.success"}, "action": "frobnicate"}'
>>> json.loads(text, object_hook=as_enum)
{'status': <Status.success: 0>, 'action': 'frobnicate'}

If you want to encode an arbitrary enum.Enum member to JSON and then decode it as the same enum member (rather than simply the enum member’s value attribute), you can do so by writing a custom JSONEncoder class, and a decoding function to pass as the object_hook argument to json.load() or json.loads():

PUBLIC_ENUMS = {
    'Status': Status,
    # ...
}

class EnumEncoder(json.JSONEncoder):
    def default(self, obj):
        if type(obj) in PUBLIC_ENUMS.values():
            return {"__enum__": str(obj)}
        return json.JSONEncoder.default(self, obj)

def as_enum(d):
    if "__enum__" in d:
        name, member = d["__enum__"].split(".")
        return getattr(PUBLIC_ENUMS[name], member)
    else:
        return d

The as_enum function relies on the JSON having been encoded using EnumEncoder, or something which behaves identically to it.

The restriction to members of PUBLIC_ENUMS is necessary to avoid a maliciously crafted text being used to, for example, trick calling code into saving private information (e.g. a secret key used by the application) to an unrelated database field, from where it could then be exposed (see http://chat.stackoverflow.com/transcript/message/35999686#35999686).

Example usage:

>>> data = {
...     "action": "frobnicate",
...     "status": Status.success
... }
>>> text = json.dumps(data, cls=EnumEncoder)
>>> text
'{"status": {"__enum__": "Status.success"}, "action": "frobnicate"}'
>>> json.loads(text, object_hook=as_enum)
{'status': <Status.success: 0>, 'action': 'frobnicate'}

回答 1

我知道这很老,但我认为这会对人们有所帮助。我刚刚经历了这个确切的问题,发现您是否使用字符串枚举,将您的枚举声明str为几乎所有情况下都可以正常工作的子类:

import json
from enum import Enum

class LogLevel(str, Enum):
    DEBUG = 'DEBUG'
    INFO = 'INFO'

print(LogLevel.DEBUG)
print(json.dumps(LogLevel.DEBUG))
print(json.loads('"DEBUG"'))
print(LogLevel('DEBUG'))

将输出:

LogLevel.DEBUG
"DEBUG"
DEBUG
LogLevel.DEBUG

如您所见,加载JSON将输出字符串,DEBUG但可以轻松将其转换回LogLevel对象。如果您不想创建自定义JSONEncoder,则是一个不错的选择。

I know this is old but I feel this will help people. I just went through this exact problem and discovered if you’re using string enums, declaring your enums as a subclass of str works well for almost all situations:

import json
from enum import Enum

class LogLevel(str, Enum):
    DEBUG = 'DEBUG'
    INFO = 'INFO'

print(LogLevel.DEBUG)
print(json.dumps(LogLevel.DEBUG))
print(json.loads('"DEBUG"'))
print(LogLevel('DEBUG'))

Will output:

LogLevel.DEBUG
"DEBUG"
DEBUG
LogLevel.DEBUG

As you can see, loading the JSON outputs the string DEBUG but it is easily castable back into a LogLevel object. A good option if you don’t want to create a custom JSONEncoder.


回答 2

正确答案取决于您打算对序列化版本进行的处理。

如果您要反序列化回Python,请参见Zero的答案

如果您的序列化版本将要使用另一种语言,那么您可能想使用IntEnum代替,它将自动序列化为相应的整数:

from enum import IntEnum
import json

class Status(IntEnum):
    success = 0
    failure = 1

json.dumps(Status.success)

这将返回:

'0'

The correct answer depends on what you intend to do with the serialized version.

If you are going to unserialize back into Python, see Zero’s answer.

If your serialized version is going to another language then you probably want to use an IntEnum instead, which is automatically serialized as the corresponding integer:

from enum import IntEnum
import json

class Status(IntEnum):
    success = 0
    failure = 1

json.dumps(Status.success)

and this returns:

'0'

回答 3

在Python 3.7中,只能使用 json.dumps(enum_obj, default=str)

In Python 3.7, can just use json.dumps(enum_obj, default=str)


回答 4

我喜欢Zero Piraeus的回答,但是对它进行了稍作修改,以便使用称为Boto的Amazon Web Services(AWS)的API。

class EnumEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Enum):
            return obj.name
        return json.JSONEncoder.default(self, obj)

然后,我将此方法添加到我的数据模型中:

    def ToJson(self) -> str:
        return json.dumps(self.__dict__, cls=EnumEncoder, indent=1, sort_keys=True)

我希望这可以帮助别人。

I liked Zero Piraeus’ answer, but modified it slightly for working with the API for Amazon Web Services (AWS) known as Boto.

class EnumEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Enum):
            return obj.name
        return json.JSONEncoder.default(self, obj)

I then added this method to my data model:

    def ToJson(self) -> str:
        return json.dumps(self.__dict__, cls=EnumEncoder, indent=1, sort_keys=True)

I hope this helps someone.


回答 5

如果您使用的jsonpickle是最简单的方法,则应如下所示。

from enum import Enum
import jsonpickle


@jsonpickle.handlers.register(Enum, base=True)
class EnumHandler(jsonpickle.handlers.BaseHandler):

    def flatten(self, obj, data):
        return obj.value  # Convert to json friendly format


if __name__ == '__main__':
    class Status(Enum):
        success = 0
        error = 1

    class SimpleClass:
        pass

    simple_class = SimpleClass()
    simple_class.status = Status.success

    json = jsonpickle.encode(simple_class, unpicklable=False)
    print(json)

在Json序列化之后,您将获得预期的{"status": 0}而不是

{"status": {"__objclass__": {"py/type": "__main__.Status"}, "_name_": "success", "_value_": 0}}

If you are using jsonpickle the easiest way should look as below.

from enum import Enum
import jsonpickle


@jsonpickle.handlers.register(Enum, base=True)
class EnumHandler(jsonpickle.handlers.BaseHandler):

    def flatten(self, obj, data):
        return obj.value  # Convert to json friendly format


if __name__ == '__main__':
    class Status(Enum):
        success = 0
        error = 1

    class SimpleClass:
        pass

    simple_class = SimpleClass()
    simple_class.status = Status.success

    json = jsonpickle.encode(simple_class, unpicklable=False)
    print(json)

After Json serialization you will have as expected {"status": 0} instead of

{"status": {"__objclass__": {"py/type": "__main__.Status"}, "_name_": "success", "_value_": 0}}

回答 6

这为我工作:

class Status(Enum):
    success = 0

    def __json__(self):
        return self.value

不必更改其他任何内容。显然,您只会从中获得该值,并且如果您想稍后将序列化的值转换回枚举,则需要做一些其他工作。

This worked for me:

class Status(Enum):
    success = 0

    def __json__(self):
        return self.value

Didn’t have to change anything else. Obviously, you’ll only get the value out of this and will need to do some other work if you want to convert the serialized value back into the enum later.


不可JSON序列化

问题:不可JSON序列化

我有以下代码序列化查询集;

def render_to_response(self, context, **response_kwargs):

    return HttpResponse(json.simplejson.dumps(list(self.get_queryset())),
                        mimetype="application/json")

以下是我的 get_querset()

[{'product': <Product: hederello ()>, u'_id': u'9802', u'_source': {u'code': u'23981', u'facilities': [{u'facility': {u'name': {u'fr': u'G\xe9n\xe9ral', u'en': u'General'}, u'value': {u'fr': [u'bar', u'r\xe9ception ouverte 24h/24', u'chambres non-fumeurs', u'chambres familiales',.........]}]

我需要序列化。但是它说无法序列化<Product: hederello ()>。因为列表由Django对象和字典组成。有任何想法吗 ?

I have the following code for serializing the queryset;

def render_to_response(self, context, **response_kwargs):

    return HttpResponse(json.simplejson.dumps(list(self.get_queryset())),
                        mimetype="application/json")

And following is my get_querset()

[{'product': <Product: hederello ()>, u'_id': u'9802', u'_source': {u'code': u'23981', u'facilities': [{u'facility': {u'name': {u'fr': u'G\xe9n\xe9ral', u'en': u'General'}, u'value': {u'fr': [u'bar', u'r\xe9ception ouverte 24h/24', u'chambres non-fumeurs', u'chambres familiales',.........]}]

Which I need to serialize. But it says not able to serialize the <Product: hederello ()>. Because list composed of both django objects and dicts. Any ideas ?


回答 0

simplejson并且json不能很好地与Django对象配合使用。

Django的内置序列化器只能序列化由django对象填充的查询集:

data = serializers.serialize('json', self.get_queryset())
return HttpResponse(data, content_type="application/json")

就您而言,self.get_queryset()其中包含django对象和dict的混合。

一种选择是摆脱中的模型实例,self.get_queryset()并使用dict将其替换为model_to_dict

from django.forms.models import model_to_dict

data = self.get_queryset()

for item in data:
   item['product'] = model_to_dict(item['product'])

return HttpResponse(json.simplejson.dumps(data), mimetype="application/json")

希望能有所帮助。

simplejson and json don’t work with django objects well.

Django’s built-in serializers can only serialize querysets filled with django objects:

data = serializers.serialize('json', self.get_queryset())
return HttpResponse(data, content_type="application/json")

In your case, self.get_queryset() contains a mix of django objects and dicts inside.

One option is to get rid of model instances in the self.get_queryset() and replace them with dicts using model_to_dict:

from django.forms.models import model_to_dict

data = self.get_queryset()

for item in data:
   item['product'] = model_to_dict(item['product'])

return HttpResponse(json.simplejson.dumps(data), mimetype="application/json")

Hope that helps.


回答 1

最简单的方法是使用JsonResponse

对于查询集,您应传递该查询集的的列表values,如下所示:

from django.http import JsonResponse

queryset = YourModel.objects.filter(some__filter="some value").values()
return JsonResponse({"models_to_return": list(queryset)})

The easiest way is to use a JsonResponse.

For a queryset, you should pass a list of the the values for that queryset, like so:

from django.http import JsonResponse

queryset = YourModel.objects.filter(some__filter="some value").values()
return JsonResponse({"models_to_return": list(queryset)})

回答 2

我发现可以使用“ .values”方法相当简单地完成此操作,该方法还提供了命名字段:

result_list = list(my_queryset.values('first_named_field', 'second_named_field'))
return HttpResponse(json.dumps(result_list))

必须使用“列表”来获取可迭代的数据,因为“值查询集”类型仅当作为可迭代的拾取时才是字典。

文档:https : //docs.djangoproject.com/en/1.7/ref/models/querysets/#values

I found that this can be done rather simple using the “.values” method, which also gives named fields:

result_list = list(my_queryset.values('first_named_field', 'second_named_field'))
return HttpResponse(json.dumps(result_list))

“list” must be used to get data as iterable, since the “value queryset” type is only a dict if picked up as an iterable.

Documentation: https://docs.djangoproject.com/en/1.7/ref/models/querysets/#values


回答 3

从1.9版本开始,更轻松和官方的获取json的方式

from django.http import JsonResponse
from django.forms.models import model_to_dict


return JsonResponse(  model_to_dict(modelinstance) )

From version 1.9 Easier and official way of getting json

from django.http import JsonResponse
from django.forms.models import model_to_dict


return JsonResponse(  model_to_dict(modelinstance) )

回答 4

我们的js程序员要求我向她返回确切的JSON格式数据,而不是json编码的字符串。

下面是解决方案(这将返回一个可以在浏览器中直接使用/查看的对象)

import json
from xxx.models import alert
from django.core import serializers

def test(request):
    alert_list = alert.objects.all()

    tmpJson = serializers.serialize("json",alert_list)
    tmpObj = json.loads(tmpJson)

    return HttpResponse(json.dumps(tmpObj))

Our js-programmer asked me to return the exact JSON format data instead of a json-encoded string to her.

Below is the solution.(This will return an object that can be used/viewed straightly in the browser)

import json
from xxx.models import alert
from django.core import serializers

def test(request):
    alert_list = alert.objects.all()

    tmpJson = serializers.serialize("json",alert_list)
    tmpObj = json.loads(tmpJson)

    return HttpResponse(json.dumps(tmpObj))

回答 5

首先,我在模型中添加了to_dict方法;

def to_dict(self):
    return {"name": self.woo, "title": self.foo}

然后我有这个;

class DjangoJSONEncoder(JSONEncoder):

    def default(self, obj):
        if isinstance(obj, models.Model):
            return obj.to_dict()
        return JSONEncoder.default(self, obj)


dumps = curry(dumps, cls=DjangoJSONEncoder)

最后使用此类来序列化我的查询集。

def render_to_response(self, context, **response_kwargs):
    return HttpResponse(dumps(self.get_queryset()))

这个效果很好

First I added a to_dict method to my model ;

def to_dict(self):
    return {"name": self.woo, "title": self.foo}

Then I have this;

class DjangoJSONEncoder(JSONEncoder):

    def default(self, obj):
        if isinstance(obj, models.Model):
            return obj.to_dict()
        return JSONEncoder.default(self, obj)


dumps = curry(dumps, cls=DjangoJSONEncoder)

and at last use this class to serialize my queryset.

def render_to_response(self, context, **response_kwargs):
    return HttpResponse(dumps(self.get_queryset()))

This works quite well


何时使用序列化器的create()和ModelViewset的create()perform_create()

问题:何时使用序列化器的create()和ModelViewset的create()perform_create()

我想澄清django-rest-framework有关创建模型对象的给定文档。到目前为止,我发现有3种方法来处理此类事件。

  1. 序列化器的create()方法。这是文档

    class CommentSerializer(serializers.Serializer):
    
        def create(self, validated_data):
            return Comment.objects.create(**validated_data)
    
  2. ModelViewsetcreate()方法。文献资料

    class AccountViewSet(viewsets.ModelViewSet):
    
        queryset = Account.objects.all()
        serializer_class = AccountSerializer
        permission_classes = [IsAccountAdminOrReadOnly]
    
  3. ModelViewsetperform_create()方法。文献资料

    class SnippetViewSet(viewsets.ModelViewSet):
    
        def perform_create(self, serializer):
            serializer.save(owner=self.request.user)
    

这三种方法很重要,具体取决于您的应用程序环境。

但是什么时候我们需要使用每个create() / perform_create()函数?另一方面,我发现有人要求为单个发布请求调用modelviewsetcreate()和serializer的两个create方法create()

希望任何人都可以分享他们的一些知识来进行解释,这肯定会对我的开发过程有所帮助。

I want to clarify the given documentation of django-rest-framework regarding the creation of a model object. So far I found that there are 3 approaches on how to handle such events.

  1. The Serializer’s create() method. Here is the documentation

    class CommentSerializer(serializers.Serializer):
    
        def create(self, validated_data):
            return Comment.objects.create(**validated_data)
    
  2. The ModelViewset create() method. Documentation

    class AccountViewSet(viewsets.ModelViewSet):
    
        queryset = Account.objects.all()
        serializer_class = AccountSerializer
        permission_classes = [IsAccountAdminOrReadOnly]
    
  3. The ModelViewset perform_create() method. Documentation

    class SnippetViewSet(viewsets.ModelViewSet):
    
        def perform_create(self, serializer):
            serializer.save(owner=self.request.user)
    

These three approaches are important depending on your application environment.

But WHEN do we need to use each create() / perform_create() function??. On the other hand I found some account that two create methods were called for a single post request the modelviewset’s create() and serializer’s create().

Hopefully anyone would share some of their knowledge to explain and this will surely be very helpful in my development process.


回答 0

  1. 你会使用create(self, validated_data)节约型和“刺”的价值观为就像每个模型前场添加任何额外的细节到对象**validated_data一样。理想情况下,您只想在一个位置执行这种“探测”形式,因此create您的方法CommentSerializer是最佳的选择。最重要的是,您可能还想调用外部api,以在将帐户保存到自己的数据库之前在其旁边创建用户帐户。您应该将此create功能与结合使用ModelViewSet。永远想一想-“薄视图,厚串行器”。

例:

def create(self, validated_data):
    email = validated_data.get("email", None)
    validated.pop("email") 
    # Now you have a clean valid email string 
    # You might want to call an external API or modify another table
    # (eg. keep track of number of accounts registered.) or even
    # make changes to the email format.

    # Once you are done, create the instance with the validated data
    return models.YourModel.objects.create(email=email, **validated_data)
  1. 中的create(self, request, *args, **kwargs)函数在的父类中ModelViewSet定义。的主要功能如下:CreateModelMixinModelViewSetCreateModelMixin

    from rest_framework import status
    from rest_framework.response import Response
    
    
    def create(self, request, *args, **kwargs):
        serializer = self.get_serializer(data=request.data)
        serializer.is_valid(raise_exception=True)
        self.perform_create(serializer)
        headers = self.get_success_headers(serializer.data)
        return Response(serializer.data, status=status.HTTP_201_CREATED, headers=headers)
    
    def perform_create(self, serializer):
        serializer.save()
    

如您所见,以上create函数负责在序列化程序上调用验证并产生正确的响应。这样做的好处是,您现在可以隔离应用程序逻辑,而不必担心平凡和重复的验证调用以及处理响应输出:)。与create(self, validated_data)序列化器(您的特定应用程序逻辑所在的位置)中的结合使用时,这可以很好地工作。

  1. 现在您可能会问,为什么我们perform_create(self, serializer)只有一行代码才有一个单独的函数!好吧,这背后的主要原因是在调用save函数时允许自定义。您可能想要在调用之前提供额外的数据save (例如serializer.save(owner=self.request.user),如果我们没有perform_create(self, serializer),那么您将不得不重写,create(self, request, *args, **kwargs)而这违背了让mixin进行繁重而乏味的工作的目的。

希望这可以帮助!

  1. You would use create(self, validated_data) to add any extra details into the object before saving AND “prod” values into each model field just like **validated_data does. Ideally speaking, you want to do this form of “prodding” only in ONE location so the create method in your CommentSerializer is the best place. On top of this, you might want to also call external apis to create user accounts on their side just before saving your accounts into your own database. You should use this create function in conjunction withModelViewSet. Always think – “Thin views, Thick serializers”.

Example:

def create(self, validated_data):
    email = validated_data.get("email", None)
    validated.pop("email") 
    # Now you have a clean valid email string 
    # You might want to call an external API or modify another table
    # (eg. keep track of number of accounts registered.) or even
    # make changes to the email format.

    # Once you are done, create the instance with the validated data
    return models.YourModel.objects.create(email=email, **validated_data)
  1. The create(self, request, *args, **kwargs) function in the ModelViewSet is defined in the CreateModelMixin class which is the parent of ModelViewSet. CreateModelMixin‘s main functions are these:

    from rest_framework import status
    from rest_framework.response import Response
    
    
    def create(self, request, *args, **kwargs):
        serializer = self.get_serializer(data=request.data)
        serializer.is_valid(raise_exception=True)
        self.perform_create(serializer)
        headers = self.get_success_headers(serializer.data)
        return Response(serializer.data, status=status.HTTP_201_CREATED, headers=headers)
    
    def perform_create(self, serializer):
        serializer.save()
    

As you can see, the above create function takes care of calling validation on your serializer and producing the correct response. The beauty behind this, is that you can now isolate your application logic and NOT concern yourself about the mundane and repetitive validation calls and handling response output :). This works quite well in conjuction with the create(self, validated_data) found in the serializer (where your specific application logic might reside).

  1. Now you might ask, why do we have a separate perform_create(self, serializer) function with just one line of code!?!? Well, the main reason behind this is to allow customizeability when calling the save function. You might want to supply extra data before calling save (like serializer.save(owner=self.request.user) and if we didn’t have perform_create(self, serializer), you would have to override the create(self, request, *args, **kwargs) and that just defeats the purpose of having mixins doing the heavy and boring work.

Hope this helps!


如何在Python中从文件/流中懒惰地读取多个JSON值?

问题:如何在Python中从文件/流中懒惰地读取多个JSON值?

我想一次从Python的文件/流中读取多个JSON对象。不幸的是json.load().read()直到文件结束为止。似乎没有任何方法可以使用它来读取单个对象或延迟迭代这些对象。

有什么办法吗?使用标准库将是理想的选择,但是如果有第三方库,我会改用它。

目前,我将每个对象放在单独的行上并使用json.loads(f.readline()),但我真的不希望这样做。

使用范例

example.py

import my_json as json
import sys

for o in json.iterload(sys.stdin):
    print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

示例会话

$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int

I’d like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load() just .read()s until end-of-file; there doesn’t seem to be any way to use it to read a single object or to lazily iterate over the objects.

Is there any way to do this? Using the standard library would be ideal, but if there’s a third-party library I’d use that instead.

At the moment I’m putting each object on a separate line and using json.loads(f.readline()), but I would really prefer not to need to do this.

Example Use

example.py

import my_json as json
import sys

for o in json.iterload(sys.stdin):
    print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

example session

$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int

回答 0

这是一个非常简单的解决方案。秘诀是尝试,失败并使用异常中的信息正确解析。唯一的限制是该文件必须可搜索。

def stream_read_json(fn):
    import json
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                start_pos += e.pos
                yield obj

编辑:只是注意到这仅适用于Python> = 3.5。对于较早版本,失败将返回ValueError,并且您必须从字符串中解析出位置,例如

def stream_read_json(fn):
    import json
    import re
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except ValueError as e:
                f.seek(start_pos)
                end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
                                    e.args[0]).groups()[0])
                json_str = f.read(end_pos)
                obj = json.loads(json_str)
                start_pos += end_pos
                yield obj

Here’s a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.

def stream_read_json(fn):
    import json
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                start_pos += e.pos
                yield obj

Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.

def stream_read_json(fn):
    import json
    import re
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except ValueError as e:
                f.seek(start_pos)
                end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
                                    e.args[0]).groups()[0])
                json_str = f.read(end_pos)
                obj = json.loads(json_str)
                start_pos += end_pos
                yield obj

回答 1

JSON通常对于这种增量使用不是很好。没有序列化多个对象的标准方法,这样就可以轻松地一次加载一个对象,而无需解析整个对象。

您正在使用的每行对象解决方案也可以在其他地方看到。Scrapy将其称为“ JSON行”:

您可以用Python稍微做到一点:

for jsonline in f:
    yield json.loads(jsonline)   # or do the processing in this loop

我认为这是最好的方法-它不依赖任何第三方库,而且很容易理解发生了什么。我也在自己的一些代码中使用过它。

JSON generally isn’t very good for this sort of incremental use; there’s no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.

The object per line solution that you’re using is seen elsewhere too. Scrapy calls it ‘JSON lines’:

You can do it slightly more Pythonically:

for jsonline in f:
    yield json.loads(jsonline)   # or do the processing in this loop

I think this is about the best way – it doesn’t rely on any third party libraries, and it’s easy to understand what’s going on. I’ve used it in some of my own code as well.


回答 2

也许有点晚了,但是我有这个确切的问题(或多或少)。对于这些问题,我的标准解决方案通常是仅对某些众所周知的根对象进行正则表达式拆分,但对我而言这是不可能的。一般而言,唯一可行的方法是实现适当的标记器

在没有找到足够通用且性能合理的解决方案之后,我结束了自己编写splitstream模块的工作。这是一个预令牌器,可以理解JSON和XML,并将连续流分成多个块进行解析(不过实际解析由您自己决定)。为了获得某种性能,它被编写为C模块。

例:

from splitstream import splitfile

for jsonstr in splitfile(sys.stdin, format="json")):
    yield json.loads(jsonstr)

A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.

After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.

Example:

from splitstream import splitfile

for jsonstr in splitfile(sys.stdin, format="json")):
    yield json.loads(jsonstr)

回答 3

当然可以。您只需要raw_decode直接采取。该实现将整个文件加载到内存中并对该字符串进行操作(与之类似json.load);如果您有大文件,则可以对其进行修改,使其仅在必要时从文件中读取而没有太大困难。

import json
from json.decoder import WHITESPACE

def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
    if isinstance(string_or_fp, file):
        string = string_or_fp.read()
    else:
        string = str(string_or_fp)

    decoder = cls(**kwargs)
    idx = WHITESPACE.match(string, 0).end()
    while idx < len(string):
        obj, end = decoder.raw_decode(string, idx)
        yield obj
        idx = WHITESPACE.match(string, end).end()

用法:按照您的要求,它是一个发生器。

Sure you can do this. You just have to take to raw_decode directly. This implementation loads the whole file into memory and operates on that string (much as json.load does); if you have large files you can modify it to only read from the file as necessary without much difficulty.

import json
from json.decoder import WHITESPACE

def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
    if isinstance(string_or_fp, file):
        string = string_or_fp.read()
    else:
        string = str(string_or_fp)

    decoder = cls(**kwargs)
    idx = WHITESPACE.match(string, 0).end()
    while idx < len(string):
        obj, end = decoder.raw_decode(string, idx)
        yield obj
        idx = WHITESPACE.match(string, end).end()

Usage: just as you requested, it’s a generator.


回答 4

这实际上是一个非常棘手的问题,因为您必须逐行进行流式处理,但是跨多行的模式匹配要针对大括号,还需要模式匹配json。这是一种json-preparse,然后是json parse。与其他格式相比,Json易于解析,因此不一定总是需要解析库,但是,我们应该如何解决这些矛盾的问题?

生成器来救援!

生成器对于此类问题的好处是,您可以将它们堆叠在一起,从而逐渐消除问题的难度,同时保持惰性。我还考虑过使用将值传回生成器的机制(send()),但幸运的是,我不需要使用该机制。

要解决第一个问题,您需要某种streamingfinditer,作为re.finditer的流版本。我在下面的尝试根据需要插入行(取消注释调试语句以查看),同时仍返回匹配项。然后,我实际上对其进行了一些修改,以产生不匹配的行和匹配项(在生成的元组的第一部分中标记为0或1)。

import re

def streamingfinditer(pat,stream):
  for s in stream:
#    print "Read next line: " + s
    while 1:
      m = re.search(pat,s)
      if not m:
        yield (0,s)
        break
      yield (1,m.group())
      s = re.split(pat,s,1)[1]

这样,就可以匹配直到大括号,每次都考虑大括号是否平衡,然后根据需要返回简单或复合对象。

braces='{}[]'
whitespaceesc=' \t'
bracesesc='\\'+'\\'.join(braces)
balancemap=dict(zip(braces,[1,-1,1,-1]))
bracespat='['+bracesesc+']'
nobracespat='[^'+bracesesc+']*'
untilbracespat=nobracespat+bracespat

def simpleorcompoundobjects(stream):
  obj = ""
  unbalanced = 0
  for (c,m) in streamingfinditer(re.compile(untilbracespat),stream):
    if (c == 0): # remainder of line returned, nothing interesting
      if (unbalanced == 0):
        yield (0,m)
      else:
        obj += m
    if (c == 1): # match returned
      if (unbalanced == 0):
        yield (0,m[:-1])
        obj += m[-1]
      else:
        obj += m
      unbalanced += balancemap[m[-1]]
      if (unbalanced == 0):
        yield (1,obj)
        obj="" 

这将返回元组,如下所示:

(0,"String of simple non-braced objects easy to parse")
(1,"{ 'Compound' : 'objects' }")

基本上这就是讨厌的部分。现在,我们只需要按照我们认为合适的方式进行最终的解析即可。例如,我们可以使用Jeremy Roman的iterload函数(谢谢!)对一行进行解析:

def streamingiterload(stream):
  for c,o in simpleorcompoundobjects(stream):
    for x in iterload(o):
      yield x 

测试一下:

of = open("test.json","w") 
of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 {
} 2
9 78
 4 5 { "animals" : [ "dog" , "lots of mice" ,
 "cat" ] }
""")
of.close()
// open & stream the json
f = open("test.json","r")
for o in streamingiterload(f.readlines()):
  print o
f.close()

我得到了这些结果(如果您打开该调试行,则将看到它按需要插入行中):

[u'hello']
{u'goodbye': 1}
1
2
{}
2
9
78
4
5
{u'animals': [u'dog', u'lots of mice', u'cat']}

这并非在所有情况下都适用。由于该json库的实现,如果不自己重新实现解析器,就不可能完全正确地工作。

This is a pretty nasty problem actually because you have to stream in lines, but pattern match across multiple lines against braces, but also pattern match json. It’s a sort of json-preparse followed by a json parse. Json is, in comparison to other formats, easy to parse so it’s not always necessary to go for a parsing library, nevertheless, how to should we solve these conflicting issues?

Generators to the rescue!

The beauty of generators for a problem like this is you can stack them on top of each other gradually abstracting away the difficulty of the problem whilst maintaining laziness. I also considered using the mechanism for passing back values into a generator (send()) but fortunately found I didn’t need to use that.

To solve the first of the problems you need some sort of streamingfinditer, as a streaming version of re.finditer. My attempt at this below pulls in lines as needed (uncomment the debug statement to see) whilst still returning matches. I actually then modified it slightly to yield non-matched lines as well as matches (marked as 0 or 1 in the first part of the yielded tuple).

import re

def streamingfinditer(pat,stream):
  for s in stream:
#    print "Read next line: " + s
    while 1:
      m = re.search(pat,s)
      if not m:
        yield (0,s)
        break
      yield (1,m.group())
      s = re.split(pat,s,1)[1]

With that, it’s then possible to match up until braces, account each time for whether the braces are balanced, and then return either simple or compound objects as appropriate.

braces='{}[]'
whitespaceesc=' \t'
bracesesc='\\'+'\\'.join(braces)
balancemap=dict(zip(braces,[1,-1,1,-1]))
bracespat='['+bracesesc+']'
nobracespat='[^'+bracesesc+']*'
untilbracespat=nobracespat+bracespat

def simpleorcompoundobjects(stream):
  obj = ""
  unbalanced = 0
  for (c,m) in streamingfinditer(re.compile(untilbracespat),stream):
    if (c == 0): # remainder of line returned, nothing interesting
      if (unbalanced == 0):
        yield (0,m)
      else:
        obj += m
    if (c == 1): # match returned
      if (unbalanced == 0):
        yield (0,m[:-1])
        obj += m[-1]
      else:
        obj += m
      unbalanced += balancemap[m[-1]]
      if (unbalanced == 0):
        yield (1,obj)
        obj="" 

This returns tuples as follows:

(0,"String of simple non-braced objects easy to parse")
(1,"{ 'Compound' : 'objects' }")

Basically that’s the nasty part done. We now just have to do the final level of parsing as we see fit. For example we can use Jeremy Roman’s iterload function (Thanks!) to do parsing for a single line:

def streamingiterload(stream):
  for c,o in simpleorcompoundobjects(stream):
    for x in iterload(o):
      yield x 

Test it:

of = open("test.json","w") 
of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 {
} 2
9 78
 4 5 { "animals" : [ "dog" , "lots of mice" ,
 "cat" ] }
""")
of.close()
// open & stream the json
f = open("test.json","r")
for o in streamingiterload(f.readlines()):
  print o
f.close()

I get these results (and if you turn on that debug line, you’ll see it pulls in the lines as needed):

[u'hello']
{u'goodbye': 1}
1
2
{}
2
9
78
4
5
{u'animals': [u'dog', u'lots of mice', u'cat']}

This won’t work for all situations. Due to the implementation of the json library, it is impossible to work entirely correctly without reimplementing the parser yourself.


回答 5

我相信这样做的更好方法是使用状态机。以下是我通过将下面的链接上的NodeJS代码转换为Python 3得出的示例代码(使用的非本地关键字仅在Python 3中可用,该代码在Python 2上不起作用)

编辑1:更新并使其代码与Python 2兼容

编辑2:更新并添加了仅Python3版本

https://gist.github.com/creationix/5992451

仅限Python 3版本

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    i = 0
    length = len(bytes_data)

    def _constant(byte_data):
        nonlocal i
        if byte_data != bytes_data[i]:
            i += 1
            raise Exception("Unexpected 0x" + str(byte_data))

        i += 1
        if i < length:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    string = ""

    def _string(byte_data):
        nonlocal string

        if byte_data == 0x22:  # "
            return emit(string)

        if byte_data == 0x5c:  # \
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + str(byte_data))

        string += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # " \ /
            string += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            string += "\b"
            return _string

        if byte_data == 0x66:  # f
            string += "\f"
            return _string

        if byte_data == 0x6e:  # n
            string += "\n"
            return _string

        if byte_data == 0x72:  # r
            string += "\r"
            return _string

        if byte_data == 0x74:  # t
            string += "\t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        nonlocal string
        string += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    left = 0
    num = 0

    def _utf8(byte_data):
        nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        left = left - 1

        num |= (byte_data & 0x3f) << (left * 6)
        if left:
            return _utf8
        return emit(num)

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        left = 1
        num = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        left = 2
        num = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        left = 3
        num = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    left = 4
    num = 0

    def _hex(byte_data):
        nonlocal num, left

        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        left -= 1
        num |= i << (left * 4)

        if left:
            return _hex
        return emit(num)

    return _hex


def number_machine(byte_data, emit):
    sign = 1
    number = 0
    decimal = 0
    esign = 1
    exponent = 0

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        nonlocal number
        if 0x30 <= byte_data < 0x40:
            number = number * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + str(byte_data))

    if byte_data == 0x2d:  # -
        sign = -1
        return _start

    def _decimal(byte_data):
        nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            decimal = (decimal + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            esign = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            exponent = exponent * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = sign * (number + decimal)
        if exponent:
            value *= math.pow(10, esign * exponent)

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    array_data = []

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(array_data)

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        array_data.append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(array_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    object_data = {}
    key = None

    def _object(byte_data):
        if byte_data == 0x7d:  #
            return emit(object_data)

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_key(result):
        nonlocal key
        key = result
        return _colon

    def _colon(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        object_data[key] = value

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(object_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

Python 2兼容版本

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    local_data = {"i": 0, "length": len(bytes_data)}

    def _constant(byte_data):
        # nonlocal i, length
        if byte_data != bytes_data[local_data["i"]]:
            local_data["i"] += 1
            raise Exception("Unexpected 0x" + byte_data.toString(16))

        local_data["i"] += 1

        if local_data["i"] < local_data["length"]:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    local_data = {"string": ""}

    def _string(byte_data):
        # nonlocal string

        if byte_data == 0x22:  # "
            return emit(local_data["string"])

        if byte_data == 0x5c:  # \
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + byte_data.toString(16))

        local_data["string"] += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        # nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # " \ /
            local_data["string"] += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            local_data["string"] += "\b"
            return _string

        if byte_data == 0x66:  # f
            local_data["string"] += "\f"
            return _string

        if byte_data == 0x6e:  # n
            local_data["string"] += "\n"
            return _string

        if byte_data == 0x72:  # r
            local_data["string"] += "\r"
            return _string

        if byte_data == 0x74:  # t
            local_data["string"] += "\t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        # nonlocal string
        local_data["string"] += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    local_data = {"left": 0, "num": 0}

    def _utf8(byte_data):
        # nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        local_data["left"] -= 1

        local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6)
        if local_data["left"]:
            return _utf8
        return emit(local_data["num"])

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        local_data["left"] = 1
        local_data["num"] = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        local_data["left"] = 2
        local_data["num"] = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        local_data["left"] = 3
        local_data["num"] = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    local_data = {"left": 4, "num": 0}

    def _hex(byte_data):
        # nonlocal num, left
        i = 0  # Parse the hex byte
        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        local_data["left"] -= 1
        local_data["num"] |= i << (local_data["left"] * 4)

        if local_data["left"]:
            return _hex
        return emit(local_data["num"])

    return _hex


def number_machine(byte_data, emit):
    local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0}

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        # nonlocal number
        if 0x30 <= byte_data < 0x40:
            local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + byte_data.toString(16))

    if byte_data == 0x2d:  # -
        local_data["sign"] = -1
        return _start

    def _decimal(byte_data):
        # nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        # nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            local_data["esign"] = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        # nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = local_data["sign"] * (local_data["number"] + local_data["decimal"])
        if local_data["exponent"]:
            value *= math.pow(10, local_data["esign"] * local_data["exponent"])

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    local_data = {"array_data": []}

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        # nonlocal array_data
        local_data["array_data"].append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    local_data = {"object_data": {}, "key": ""}

    def _object(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + byte_data.toString(16))

    def on_key(result):
        # nonlocal object_data, key
        local_data["key"] = result
        return _colon

    def _colon(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        # nonlocal object_data, key
        local_data["object_data"][local_data["key"]] = value

    def _comma(byte_data):
        # nonlocal object_data
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

测试它

if __name__ == "__main__":
    test_json = """[1,2,"3"] {"name": 
    "tarun"} 1 2 
    3 [{"name":"a", 
    "data": [1,
    null,2]}]
"""
    def found_json(data):
        print(data)

    state = json_machine(found_json)

    for char in test_json:
        state = state(ord(char))

相同的输出是

[1, 2, '3']
{'name': 'tarun'}
1
2
3
[{'name': 'a', 'data': [1, None, 2]}]

I believe a better way of doing it would be to use a state machine. Below is a sample code that I worked out by converting a NodeJS code on below link to Python 3 (used nonlocal keyword only available in Python 3, code won’t work on Python 2)

Edit-1: Updated and made code compatible with Python 2

Edit-2: Updated and added a Python3 only version as well

https://gist.github.com/creationix/5992451

Python 3 only version

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    i = 0
    length = len(bytes_data)

    def _constant(byte_data):
        nonlocal i
        if byte_data != bytes_data[i]:
            i += 1
            raise Exception("Unexpected 0x" + str(byte_data))

        i += 1
        if i < length:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    string = ""

    def _string(byte_data):
        nonlocal string

        if byte_data == 0x22:  # "
            return emit(string)

        if byte_data == 0x5c:  # \
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + str(byte_data))

        string += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # " \ /
            string += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            string += "\b"
            return _string

        if byte_data == 0x66:  # f
            string += "\f"
            return _string

        if byte_data == 0x6e:  # n
            string += "\n"
            return _string

        if byte_data == 0x72:  # r
            string += "\r"
            return _string

        if byte_data == 0x74:  # t
            string += "\t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        nonlocal string
        string += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    left = 0
    num = 0

    def _utf8(byte_data):
        nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        left = left - 1

        num |= (byte_data & 0x3f) << (left * 6)
        if left:
            return _utf8
        return emit(num)

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        left = 1
        num = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        left = 2
        num = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        left = 3
        num = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    left = 4
    num = 0

    def _hex(byte_data):
        nonlocal num, left

        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        left -= 1
        num |= i << (left * 4)

        if left:
            return _hex
        return emit(num)

    return _hex


def number_machine(byte_data, emit):
    sign = 1
    number = 0
    decimal = 0
    esign = 1
    exponent = 0

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        nonlocal number
        if 0x30 <= byte_data < 0x40:
            number = number * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + str(byte_data))

    if byte_data == 0x2d:  # -
        sign = -1
        return _start

    def _decimal(byte_data):
        nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            decimal = (decimal + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            esign = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            exponent = exponent * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = sign * (number + decimal)
        if exponent:
            value *= math.pow(10, esign * exponent)

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    array_data = []

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(array_data)

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        array_data.append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(array_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    object_data = {}
    key = None

    def _object(byte_data):
        if byte_data == 0x7d:  #
            return emit(object_data)

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_key(result):
        nonlocal key
        key = result
        return _colon

    def _colon(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        object_data[key] = value

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(object_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

Python 2 compatible version

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    local_data = {"i": 0, "length": len(bytes_data)}

    def _constant(byte_data):
        # nonlocal i, length
        if byte_data != bytes_data[local_data["i"]]:
            local_data["i"] += 1
            raise Exception("Unexpected 0x" + byte_data.toString(16))

        local_data["i"] += 1

        if local_data["i"] < local_data["length"]:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    local_data = {"string": ""}

    def _string(byte_data):
        # nonlocal string

        if byte_data == 0x22:  # "
            return emit(local_data["string"])

        if byte_data == 0x5c:  # \
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + byte_data.toString(16))

        local_data["string"] += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        # nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # " \ /
            local_data["string"] += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            local_data["string"] += "\b"
            return _string

        if byte_data == 0x66:  # f
            local_data["string"] += "\f"
            return _string

        if byte_data == 0x6e:  # n
            local_data["string"] += "\n"
            return _string

        if byte_data == 0x72:  # r
            local_data["string"] += "\r"
            return _string

        if byte_data == 0x74:  # t
            local_data["string"] += "\t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        # nonlocal string
        local_data["string"] += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    local_data = {"left": 0, "num": 0}

    def _utf8(byte_data):
        # nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        local_data["left"] -= 1

        local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6)
        if local_data["left"]:
            return _utf8
        return emit(local_data["num"])

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        local_data["left"] = 1
        local_data["num"] = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        local_data["left"] = 2
        local_data["num"] = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        local_data["left"] = 3
        local_data["num"] = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    local_data = {"left": 4, "num": 0}

    def _hex(byte_data):
        # nonlocal num, left
        i = 0  # Parse the hex byte
        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        local_data["left"] -= 1
        local_data["num"] |= i << (local_data["left"] * 4)

        if local_data["left"]:
            return _hex
        return emit(local_data["num"])

    return _hex


def number_machine(byte_data, emit):
    local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0}

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        # nonlocal number
        if 0x30 <= byte_data < 0x40:
            local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + byte_data.toString(16))

    if byte_data == 0x2d:  # -
        local_data["sign"] = -1
        return _start

    def _decimal(byte_data):
        # nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        # nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            local_data["esign"] = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        # nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = local_data["sign"] * (local_data["number"] + local_data["decimal"])
        if local_data["exponent"]:
            value *= math.pow(10, local_data["esign"] * local_data["exponent"])

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    local_data = {"array_data": []}

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        # nonlocal array_data
        local_data["array_data"].append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    local_data = {"object_data": {}, "key": ""}

    def _object(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + byte_data.toString(16))

    def on_key(result):
        # nonlocal object_data, key
        local_data["key"] = result
        return _colon

    def _colon(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        # nonlocal object_data, key
        local_data["object_data"][local_data["key"]] = value

    def _comma(byte_data):
        # nonlocal object_data
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

Testing it

if __name__ == "__main__":
    test_json = """[1,2,"3"] {"name": 
    "tarun"} 1 2 
    3 [{"name":"a", 
    "data": [1,
    null,2]}]
"""
    def found_json(data):
        print(data)

    state = json_machine(found_json)

    for char in test_json:
        state = state(ord(char))

The output of the same is

[1, 2, '3']
{'name': 'tarun'}
1
2
3
[{'name': 'a', 'data': [1, None, 2]}]

回答 6

我想提供一个解决方案。关键思想是“尝试”解码:如果失败,则给它更多提要,否则使用偏移量信息准备下一次解码。

但是,当前的json模块不能容忍要解码的字符串开头的SPACE,因此我必须将其剥离。

import sys
import json

def iterload(file):
    buffer = ""
    dec = json.JSONDecoder()
    for line in file:         
        buffer = buffer.strip(" \n\r\t") + line.strip(" \n\r\t")
        while(True):
            try:
                r = dec.raw_decode(buffer)
            except:
                break
            yield r[0]
            buffer = buffer[r[1]:].strip(" \n\r\t")


for o in iterload(sys.stdin):
    print("Working on a", type(o),  o)

=========================我已经测试了多个txt文件,并且工作正常。(in1.txt)

{"foo": ["bar", "baz"]
}
 1 2 [
  ]  4
{"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}]
}
 5   6

(in2.txt)

{"foo"
: ["bar",
  "baz"]
  } 
1 2 [
] 4 5 6

(in.txt,您的首字母)

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

(本尼迪克特测试用例的输出)

python test.py < in.txt
('Working on a', <type 'list'>, [u'hello'])
('Working on a', <type 'dict'>, {u'goodbye': 1})
('Working on a', <type 'int'>, 1)
('Working on a', <type 'int'>, 2)
('Working on a', <type 'dict'>, {})
('Working on a', <type 'int'>, 2)
('Working on a', <type 'int'>, 9)
('Working on a', <type 'int'>, 78)
('Working on a', <type 'int'>, 4)
('Working on a', <type 'int'>, 5)
('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

I’d like to provide a solution. The key thought is to “try” to decode: if it fails, give it more feed, otherwise use the offset information to prepare next decoding.

However the current json module can’t tolerate SPACE in head of string to be decoded, so I have to strip them off.

import sys
import json

def iterload(file):
    buffer = ""
    dec = json.JSONDecoder()
    for line in file:         
        buffer = buffer.strip(" \n\r\t") + line.strip(" \n\r\t")
        while(True):
            try:
                r = dec.raw_decode(buffer)
            except:
                break
            yield r[0]
            buffer = buffer[r[1]:].strip(" \n\r\t")


for o in iterload(sys.stdin):
    print("Working on a", type(o),  o)

========================= I have tested for several txt files, and it works fine. (in1.txt)

{"foo": ["bar", "baz"]
}
 1 2 [
  ]  4
{"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}]
}
 5   6

(in2.txt)

{"foo"
: ["bar",
  "baz"]
  } 
1 2 [
] 4 5 6

(in.txt, your initial)

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

(output for Benedict’s testcase)

python test.py < in.txt
('Working on a', <type 'list'>, [u'hello'])
('Working on a', <type 'dict'>, {u'goodbye': 1})
('Working on a', <type 'int'>, 1)
('Working on a', <type 'int'>, 2)
('Working on a', <type 'dict'>, {})
('Working on a', <type 'int'>, 2)
('Working on a', <type 'int'>, 9)
('Working on a', <type 'int'>, 78)
('Working on a', <type 'int'>, 4)
('Working on a', <type 'int'>, 5)
('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

回答 7

这是我的:

import simplejson as json
from simplejson import JSONDecodeError
class StreamJsonListLoader():
    """
    When you have a big JSON file containint a list, such as

    [{
        ...
    },
    {
        ...
    },
    {
        ...
    },
    ...
    ]

    And it's too big to be practically loaded into memory and parsed by json.load,
    This class comes to the rescue. It lets you lazy-load the large json list.
    """

    def __init__(self, filename_or_stream):
        if type(filename_or_stream) == str:
            self.stream = open(filename_or_stream)
        else:
            self.stream = filename_or_stream

        if not self.stream.read(1) == '[':
            raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.')

    def __iter__(self):
        return self

    def next(self):
        read_buffer = self.stream.read(1)
        while True:
            try:
                json_obj = json.loads(read_buffer)

                if not self.stream.read(1) in [',',']']:
                    raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).')
                return json_obj
            except JSONDecodeError:
                next_char = self.stream.read(1)
                read_buffer += next_char
                while next_char != '}':
                    next_char = self.stream.read(1)
                    if next_char == '':
                        raise StopIteration
                    read_buffer += next_char

Here’s mine:

import simplejson as json
from simplejson import JSONDecodeError
class StreamJsonListLoader():
    """
    When you have a big JSON file containint a list, such as

    [{
        ...
    },
    {
        ...
    },
    {
        ...
    },
    ...
    ]

    And it's too big to be practically loaded into memory and parsed by json.load,
    This class comes to the rescue. It lets you lazy-load the large json list.
    """

    def __init__(self, filename_or_stream):
        if type(filename_or_stream) == str:
            self.stream = open(filename_or_stream)
        else:
            self.stream = filename_or_stream

        if not self.stream.read(1) == '[':
            raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.')

    def __iter__(self):
        return self

    def next(self):
        read_buffer = self.stream.read(1)
        while True:
            try:
                json_obj = json.loads(read_buffer)

                if not self.stream.read(1) in [',',']']:
                    raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).')
                return json_obj
            except JSONDecodeError:
                next_char = self.stream.read(1)
                read_buffer += next_char
                while next_char != '}':
                    next_char = self.stream.read(1)
                    if next_char == '':
                        raise StopIteration
                    read_buffer += next_char

回答 8

我使用@wuilang的优雅解决方案。简单的方法-读取字节,尝试解码,读取字节,尝试解码,…-起作用了,但不幸的是,它非常慢。

就我而言,我试图从文件中读取具有相同对象类型的“漂亮打印” JSON对象。这使我可以优化方法。我可以逐行读取文件,仅当找到包含“}”的行时才解码:

def iterload(stream):
    buf = ""
    dec = json.JSONDecoder()
    for line in stream:
        line = line.rstrip()
        buf = buf + line
        if line == "}":
            yield dec.raw_decode(buf)
            buf = ""

如果您碰巧使用的是每行一个紧凑的JSON,该字符串在字符串文字中转义了换行符,那么您可以放心地简化此方法:

def iterload(stream):
    dec = json.JSONDecoder()
    for line in stream:
        yield dec.raw_decode(line)

显然,这些简单的方法仅适用于非常特定的JSON。但是,如果这些假设成立,则这些解决方案将正确,快速地工作。

I used @wuilang’s elegant solution. The simple approach — read a byte, try to decode, read a byte, try to decode, … — worked, but unfortunately it was very slow.

In my case, I was trying to read “pretty-printed” JSON objects of the same object type from a file. This allowed me to optimize the approach; I could read the file line-by-line, only decoding when I found a line that contained exactly “}”:

def iterload(stream):
    buf = ""
    dec = json.JSONDecoder()
    for line in stream:
        line = line.rstrip()
        buf = buf + line
        if line == "}":
            yield dec.raw_decode(buf)
            buf = ""

If you happen to be working with one-per-line compact JSON that escapes newlines in string literals, then you can safely simplify this approach even more:

def iterload(stream):
    dec = json.JSONDecoder()
    for line in stream:
        yield dec.raw_decode(line)

Obviously, these simple approaches only work for very specific kinds of JSON. However, if these assumptions hold, these solutions work correctly and quickly.


回答 9

如果使用json.JSONDecoder实例,则可以使用raw_decode成员函数。它返回JSON值的python表示形式的元组和解析停止位置的索引。这使得切片(或在流对象中搜索)剩余的JSON值变得容易。我对多余的while循环不满意,因为它会跳过输入中不同JSON值之间的空白,但是我认为它可以完成工作。

import json

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    try:
        nread = 0
        while nread < len(vals_str):
            val, n = decoder.raw_decode(vals_str[nread:])
            nread += n
            # Skip over whitespace because of bug, below.
            while nread < len(vals_str) and vals_str[nread].isspace():
                nread += 1
            yield val
    except json.JSONDecodeError as e:
        pass
    return

下一个版本要短得多,它将占用已经解析的字符串部分。似乎由于某种原因,当字符串中的第一个字符为空格时,第二次调用json.JSONDecoder.raw_decode()似乎失败,这也是我跳过上述while循环中的空格的原因…

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    while vals_str:
        val, n = decoder.raw_decode(vals_str)
        #remove the read characters from the start.
        vals_str = vals_str[n:]
        # remove leading white space because a second call to decoder.raw_decode()
        # fails when the string starts with whitespace, and
        # I don't understand why...
        vals_str = vals_str.lstrip()
        yield val
    return

在有关json.JSONDecoder类的文档中,raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders方法包含以下内容:

这可用于从结尾可能有无关数据的字符串中解码JSON文档。

而且这些无关的数据很容易成为另一个JSON值。换句话说,在编写该方法时可能会牢记此目的。

使用上层函数的input.txt文件,我得到了原始问题中给出的示例输出。

If you use a json.JSONDecoder instance you can use raw_decode member function. It returns a tuple of python representation of the JSON value and an index to where the parsing stopped. This makes it easy to slice (or seek in a stream object) the remaining JSON values. I’m not so happy about the extra while loop to skip over the white space between the different JSON values in the input but it gets the job done in my opinion.

import json

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    try:
        nread = 0
        while nread < len(vals_str):
            val, n = decoder.raw_decode(vals_str[nread:])
            nread += n
            # Skip over whitespace because of bug, below.
            while nread < len(vals_str) and vals_str[nread].isspace():
                nread += 1
            yield val
    except json.JSONDecodeError as e:
        pass
    return

The next version is much shorter and eats the part of the string that is already parsed. It seems that for some reason a second call json.JSONDecoder.raw_decode() seems to fail when the first character in the string is a whitespace, that is also the reason why I skip over the whitespace in the whileloop above …

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    while vals_str:
        val, n = decoder.raw_decode(vals_str)
        #remove the read characters from the start.
        vals_str = vals_str[n:]
        # remove leading white space because a second call to decoder.raw_decode()
        # fails when the string starts with whitespace, and
        # I don't understand why...
        vals_str = vals_str.lstrip()
        yield val
    return

In the documentation about the json.JSONDecoder class the method raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders contains the following:

This can be used to decode a JSON document from a string that may have extraneous data at the end.

And this extraneous data can easily be another JSON value. In other words the method might be written with this purpose in mind.

With the input.txt using the upper function I obtain the example output as presented in the original question.


回答 10

您可以完全出于此目的使用https://pypi.org/project/json-stream-parser/

import sys
from json_stream_parser import load_iter
for obj in load_iter(sys.stdin):
    print(obj)

输出

{'foo': ['bar', 'baz']}
1
2
[]
4
5
6

You can use https://pypi.org/project/json-stream-parser/ for exactly that purpose.

import sys
from json_stream_parser import load_iter
for obj in load_iter(sys.stdin):
    print(obj)

output

{'foo': ['bar', 'baz']}
1
2
[]
4
5
6

如何在Django中序列化模型实例?

问题:如何在Django中序列化模型实例?

关于如何序列化模型QuerySet的文档很多,但是如何将模型实例的字段序列化为JSON?

There is a lot of documentation on how to serialize a Model QuerySet but how do you just serialize to JSON the fields of a Model Instance?


回答 0

您可以轻松地使用列表来包装所需的对象,而这正是Django序列化程序正确地序列化它所需要的,例如:

from django.core import serializers

# assuming obj is a model instance
serialized_obj = serializers.serialize('json', [ obj, ])

You can easily use a list to wrap the required object and that’s all what django serializers need to correctly serialize it, eg.:

from django.core import serializers

# assuming obj is a model instance
serialized_obj = serializers.serialize('json', [ obj, ])

回答 1

如果您要处理的模型实例列表是您最好的选择serializers.serialize(),那么它会完全满足您的需求。

但是,您要尝试序列化单个对象而不是对象的对象时会遇到问题list。这样,为了摆脱各种黑客攻击,只需使用Django即可model_to_dict(如果我没记错的serializers.serialize()话,也要依赖它):

from django.forms.models import model_to_dict

# assuming obj is your model instance
dict_obj = model_to_dict( obj )

现在,您只需要直接json.dumps调用即可将其序列化为json:

import json
serialized = json.dumps(dict_obj)

而已!:)

If you’re dealing with a list of model instances the best you can do is using serializers.serialize(), it gonna fit your need perfectly.

However, you are to face an issue with trying to serialize a single object, not a list of objects. That way, in order to get rid of different hacks, just use Django’s model_to_dict (if I’m not mistaken, serializers.serialize() relies on it, too):

from django.forms.models import model_to_dict

# assuming obj is your model instance
dict_obj = model_to_dict( obj )

You now just need one straight json.dumps call to serialize it to json:

import json
serialized = json.dumps(dict_obj)

That’s it! :)


回答 2

为了避免数组包装,请在返回响应之前将其删除:

import json
from django.core import serializers

def getObject(request, id):
    obj = MyModel.objects.get(pk=id)
    data = serializers.serialize('json', [obj,])
    struct = json.loads(data)
    data = json.dumps(struct[0])
    return HttpResponse(data, mimetype='application/json')

我也发现了关于这个主题的这篇有趣的文章:

http://timsaylor.com/convert-django-model-instances-to-dictionaries

它使用django.forms.models.model_to_dict,它看起来像是完成这项工作的理想工具。

To avoid the array wrapper, remove it before you return the response:

import json
from django.core import serializers

def getObject(request, id):
    obj = MyModel.objects.get(pk=id)
    data = serializers.serialize('json', [obj,])
    struct = json.loads(data)
    data = json.dumps(struct[0])
    return HttpResponse(data, mimetype='application/json')

I found this interesting post on the subject too:

http://timsaylor.com/convert-django-model-instances-to-dictionaries

It uses django.forms.models.model_to_dict, which looks like the perfect tool for the job.


回答 3

对此有一个很好的答案,我很惊讶没有提到它。仅需几行,您就可以处理日期,模型以及其他所有内容。

制作一个可以处理模型的自定义编码器:

from django.forms import model_to_dict
from django.core.serializers.json import DjangoJSONEncoder
from django.db.models import Model

class ExtendedEncoder(DjangoJSONEncoder):

    def default(self, o):

        if isinstance(o, Model):
            return model_to_dict(o)

        return super().default(o)

现在在使用json.dumps时使用它

json.dumps(data, cls=ExtendedEncoder)

现在,模型,日期和所有内容都可以序列化,而不必放在数组中或序列化和非序列化。您拥有的所有自定义内容都可以添加到default方法中。

您甚至可以通过以下方式使用Django的本地JsonResponse:

from django.http import JsonResponse

JsonResponse(data, encoder=ExtendedEncoder)
``

There is a good answer for this and I’m surprised it hasn’t been mentioned. With a few lines you can handle dates, models, and everything else.

Make a custom encoder that can handle models:

from django.forms import model_to_dict
from django.core.serializers.json import DjangoJSONEncoder
from django.db.models import Model

class ExtendedEncoder(DjangoJSONEncoder):

    def default(self, o):

        if isinstance(o, Model):
            return model_to_dict(o)

        return super().default(o)

Now use it when you use json.dumps

json.dumps(data, cls=ExtendedEncoder)

Now models, dates and everything can be serialized and it doesn’t have to be in an array or serialized and unserialized. Anything you have that is custom can just be added to the default method.

You can even use Django’s native JsonResponse this way:

from django.http import JsonResponse

JsonResponse(data, encoder=ExtendedEncoder)

回答 4

听起来您要问的是涉及序列化Django模型实例的数据结构以实现互操作性。其他张贴者是正确的:如果您希望将序列化表格与可以通过Django api查询数据库的python应用程序一起使用,则需要使用一个对象序列化一个查询集。另一方面,如果您需要的是在不接触数据库或不使用Django的情况下在其他地方重新添加模型实例的方法,则您需要做一些工作。

这是我的工作:

首先,我demjson用于转换。碰巧是我首先发现的,但可能不是最好的。我的实现方式取决于其功能之一,但其他转换器也应采用类似的方式。

其次,json_equivalent在可能需要序列化的所有模型上实现一个方法。这是的神奇方法demjson,但是无论您选择哪种实现,都可能要考虑一下。这个想法是,您返回一个可以直接转换为的对象json(即数组或字典)。如果您真的想自动执行此操作:

def json_equivalent(self):
    dictionary = {}
    for field in self._meta.get_all_field_names()
        dictionary[field] = self.__getattribute__(field)
    return dictionary

除非您具有完全平坦的数据结构(否ForeignKeys,数据库中只有数字和字符串,等等),否则这对您没有帮助。否则,您应该认真考虑实现此方法的正确方法。

第三,打电话给demjson.JSON.encode(instance)您,您便拥有了想要的东西。

It sounds like what you’re asking about involves serializing the data structure of a Django model instance for interoperability. The other posters are correct: if you wanted the serialized form to be used with a python application that can query the database via Django’s api, then you would wan to serialize a queryset with one object. If, on the other hand, what you need is a way to re-inflate the model instance somewhere else without touching the database or without using Django, then you have a little bit of work to do.

Here’s what I do:

First, I use demjson for the conversion. It happened to be what I found first, but it might not be the best. My implementation depends on one of its features, but there should be similar ways with other converters.

Second, implement a json_equivalent method on all models that you might need serialized. This is a magic method for demjson, but it’s probably something you’re going to want to think about no matter what implementation you choose. The idea is that you return an object that is directly convertible to json (i.e. an array or dictionary). If you really want to do this automatically:

def json_equivalent(self):
    dictionary = {}
    for field in self._meta.get_all_field_names()
        dictionary[field] = self.__getattribute__(field)
    return dictionary

This will not be helpful to you unless you have a completely flat data structure (no ForeignKeys, only numbers and strings in the database, etc.). Otherwise, you should seriously think about the right way to implement this method.

Third, call demjson.JSON.encode(instance) and you have what you want.


回答 5

如果您要问如何从模型中序列化一个对象,并且知道仅要在查询集中获取一个对象(例如,使用objects.get),则可以使用以下方法:

import django.core.serializers
import django.http
import models

def jsonExample(request,poll_id):
    s = django.core.serializers.serialize('json',[models.Poll.objects.get(id=poll_id)])
    # s is a string with [] around it, so strip them off
    o=s.strip("[]")
    return django.http.HttpResponse(o, mimetype="application/json")

这将使您具有以下形式:

{"pk": 1, "model": "polls.poll", "fields": {"pub_date": "2013-06-27T02:29:38.284Z", "question": "What's up?"}}

If you’re asking how to serialize a single object from a model and you know you’re only going to get one object in the queryset (for instance, using objects.get), then use something like:

import django.core.serializers
import django.http
import models

def jsonExample(request,poll_id):
    s = django.core.serializers.serialize('json',[models.Poll.objects.get(id=poll_id)])
    # s is a string with [] around it, so strip them off
    o=s.strip("[]")
    return django.http.HttpResponse(o, mimetype="application/json")

which would get you something of the form:

{"pk": 1, "model": "polls.poll", "fields": {"pub_date": "2013-06-27T02:29:38.284Z", "question": "What's up?"}}

回答 6

我通过向模型添加序列化方法解决了这个问题:

def toJSON(self):
    import simplejson
    return simplejson.dumps(dict([(attr, getattr(self, attr)) for attr in [f.name for f in self._meta.fields]]))

这是那些讨厌单线的冗长等效项:

def toJSON(self):
    fields = []
    for field in self._meta.fields:
        fields.append(field.name)

    d = {}
    for attr in fields:
        d[attr] = getattr(self, attr)

    import simplejson
    return simplejson.dumps(d)

_meta.fields 是模型字段的有序列表,可以从实例和模型本身进行访问。

I solved this problem by adding a serialization method to my model:

def toJSON(self):
    import simplejson
    return simplejson.dumps(dict([(attr, getattr(self, attr)) for attr in [f.name for f in self._meta.fields]]))

Here’s the verbose equivalent for those averse to one-liners:

def toJSON(self):
    fields = []
    for field in self._meta.fields:
        fields.append(field.name)

    d = {}
    for attr in fields:
        d[attr] = getattr(self, attr)

    import simplejson
    return simplejson.dumps(d)

_meta.fields is an ordered list of model fields which can be accessed from instances and from the model itself.


回答 7

这是我的解决方案,可让您轻松自定义JSON并组织相关记录

首先在模型上实现一种方法。我称是,json但是您可以随便叫它,例如:

class Car(Model):
    ...
    def json(self):
        return {
            'manufacturer': self.manufacturer.name,
            'model': self.model,
            'colors': [color.json for color in self.colors.all()],
        }

然后在视图中我这样做:

data = [car.json for car in Car.objects.all()]
return HttpResponse(json.dumps(data), content_type='application/json; charset=UTF-8', status=status)

Here’s my solution for this, which allows you to easily customize the JSON as well as organize related records

Firstly implement a method on the model. I call is json but you can call it whatever you like, e.g.:

class Car(Model):
    ...
    def json(self):
        return {
            'manufacturer': self.manufacturer.name,
            'model': self.model,
            'colors': [color.json for color in self.colors.all()],
        }

Then in the view I do:

data = [car.json for car in Car.objects.all()]
return HttpResponse(json.dumps(data), content_type='application/json; charset=UTF-8', status=status)

回答 8

使用清单,将解决问题

第1步:

 result=YOUR_MODELE_NAME.objects.values('PROP1','PROP2').all();

第2步:

 result=list(result)  #after getting data from model convert result to list

第三步:

 return HttpResponse(json.dumps(result), content_type = "application/json")

Use list, it will solve problem

Step1:

 result=YOUR_MODELE_NAME.objects.values('PROP1','PROP2').all();

Step2:

 result=list(result)  #after getting data from model convert result to list

Step3:

 return HttpResponse(json.dumps(result), content_type = "application/json")

回答 9

要序列化和反序列化,请使用以下命令:

from django.core import serializers

serial = serializers.serialize("json", [obj])
...
# .next() pulls the first object out of the generator
# .object retrieves django object the object from the DeserializedObject
obj = next(serializers.deserialize("json", serial)).object

To serialize and deserialze, use the following:

from django.core import serializers

serial = serializers.serialize("json", [obj])
...
# .next() pulls the first object out of the generator
# .object retrieves django object the object from the DeserializedObject
obj = next(serializers.deserialize("json", serial)).object

回答 10

.values() 我需要将模型实例转换为JSON。

.values()文档:https ://docs.djangoproject.com/zh/3.0/ref/models/querysets/#values

名为Project的模型的示例用法。

注意:我正在使用Django Rest Framework

    @csrf_exempt
    @api_view(["GET"])
    def get_project(request):
        id = request.query_params['id']
        data = Project.objects.filter(id=id).values()
        if len(data) == 0:
            return JsonResponse(status=404, data={'message': 'Project with id {} not found.'.format(id)})
        return JsonResponse(data[0])

有效ID的结果:

{
    "id": 47,
    "title": "Project Name",
    "description": "",
    "created_at": "2020-01-21T18:13:49.693Z",
}

.values() is what I needed to convert a model instance to JSON.

.values() documentation: https://docs.djangoproject.com/en/3.0/ref/models/querysets/#values

Example usage with a model called Project.

Note: I’m using Django Rest Framework

    @csrf_exempt
    @api_view(["GET"])
    def get_project(request):
        id = request.query_params['id']
        data = Project.objects.filter(id=id).values()
        if len(data) == 0:
            return JsonResponse(status=404, data={'message': 'Project with id {} not found.'.format(id)})
        return JsonResponse(data[0])

Result from a valid id:

{
    "id": 47,
    "title": "Project Name",
    "description": "",
    "created_at": "2020-01-21T18:13:49.693Z",
}

回答 11

如果要将单个模型对象作为json响应返回给客户端,则可以执行以下简单解决方案:

from django.forms.models import model_to_dict
from django.http import JsonResponse

movie = Movie.objects.get(pk=1)
return JsonResponse(model_to_dict(movie))

If you want to return the single model object as a json response to a client, you can do this simple solution:

from django.forms.models import model_to_dict
from django.http import JsonResponse

movie = Movie.objects.get(pk=1)
return JsonResponse(model_to_dict(movie))

回答 12

使用Django序列化器python格式,

from django.core import serializers

qs = SomeModel.objects.all()
serialized_obj = serializers.serialize('python', qs)

jsonpython格式有什么区别?

json格式将返回的结果str,而python将在返回的结果要么listOrderedDict

Use Django Serializer with python format,

from django.core import serializers

qs = SomeModel.objects.all()
serialized_obj = serializers.serialize('python', qs)

What’s difference between json and python format?

The json format will return the result as str whereas python will return the result in either list or OrderedDict


回答 13

似乎您不能序列化一个实例,而必须序列化一个对象的QuerySet。

from django.core import serializers
from models import *

def getUser(request):
    return HttpResponse(json(Users.objects.filter(id=88)))

我用完了svndjango发行版,因此在较早的版本中可能不存在。

It doesn’t seem you can serialize an instance, you’d have to serialize a QuerySet of one object.

from django.core import serializers
from models import *

def getUser(request):
    return HttpResponse(json(Users.objects.filter(id=88)))

I run out of the svn release of django, so this may not be in earlier versions.


回答 14

ville = UneVille.objects.get(nom='lihlihlihlih')
....
blablablab
.......

return HttpResponse(simplejson.dumps(ville.__dict__))

我返回我的实例的命令

因此它返回的内容类似于{‘field1’:value,“ field2”:value,….}

ville = UneVille.objects.get(nom='lihlihlihlih')
....
blablablab
.......

return HttpResponse(simplejson.dumps(ville.__dict__))

I return the dict of my instance

so it return something like {‘field1’:value,”field2″:value,….}


回答 15

这样呢:

def ins2dic(obj):
    SubDic = obj.__dict__
    del SubDic['id']
    del SubDic['_state']
return SubDic

或排除您不想要的任何东西。

how about this way:

def ins2dic(obj):
    SubDic = obj.__dict__
    del SubDic['id']
    del SubDic['_state']
return SubDic

or exclude anything you don’t want.


回答 16

与我希望从框架(最简单的方法)相比,所有这些答案都有些棘手,如果您使用其余框架,我认为到目前为止,这是最简单的方法:

rep = YourSerializerClass().to_representation(your_instance)
json.dumps(rep)

这将直接使用Serializer,同时尊重您在其上定义的字段以及任何关联等。

All of these answers were a little hacky compared to what I would expect from a framework, the simplest method, I think by far, if you are using the rest framework:

rep = YourSerializerClass().to_representation(your_instance)
json.dumps(rep)

This uses the Serializer directly, respecting the fields you’ve defined on it, as well as any associations, etc.


如何对集合进行JSON序列化?

问题:如何对集合进行JSON序列化?

我有一个Python set,其中包含带有__hash____eq__方法的对象,以确保该集合中没有重复项。

我需要对该结果进行json编码set,但是即使将一个空值传递set给该json.dumps方法也会引发TypeError

  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python2.7/json/encoder.py", line 178, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: set([]) is not JSON serializable

我知道我可以为json.JSONEncoder具有自定义default方法的类创建扩展,但是我什至不知道从哪里开始转换set。是否应该set使用默认方法中的值创建字典,然后返回该方法的编码?理想情况下,我想使默认方法能够处理原始编码器阻塞的所有数据类型(我将Mongo用作数据源,因此日期似乎也引发了此错误)

正确方向的任何提示将不胜感激。

编辑:

感谢你的回答!也许我应该更精确一些。

我利用(并赞成)这里的答案来解决set翻译的局限性,但是内部键也是一个问题。

中的set对象是转换为的复杂对象__dict__,但它们本身也可以包含其属性值,这些值可能不符合json编码器中的基本类型。

涉及到很多不同的类型set,并且哈希基本上为实体计算了唯一的ID,但是按照NoSQL的真正精神,没有确切说明子对象包含什么。

一个对象可能包含的日期值starts,而另一个对象可能具有一些其他模式,该模式不包含包含“非原始”对象的键。

这就是为什么我唯一能想到的解决方案是扩展JSONEncoder替换default方法以打开不同情况的方法-但我不确定如何进行此操作,并且文档不明确。在嵌套对象中,是default按键返回go 的值,还是只是查看整个对象的通用包含/丢弃?该方法如何容纳嵌套值?我已经看过先前的问题,但似乎找不到最佳的针对特定情况的编码的方法(不幸的是,这似乎是我在这里需要做的事情)。

I have a Python set that contains objects with __hash__ and __eq__ methods in order to make certain no duplicates are included in the collection.

I need to json encode this result set, but passing even an empty set to the json.dumps method raises a TypeError.

  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python2.7/json/encoder.py", line 178, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: set([]) is not JSON serializable

I know I can create an extension to the json.JSONEncoder class that has a custom default method, but I’m not even sure where to begin in converting over the set. Should I create a dictionary out of the set values within the default method, and then return the encoding on that? Ideally, I’d like to make the default method able to handle all the datatypes that the original encoder chokes on (I’m using Mongo as a data source so dates seem to raise this error too)

Any hint in the right direction would be appreciated.

EDIT:

Thanks for the answer! Perhaps I should have been more precise.

I utilized (and upvoted) the answers here to get around the limitations of the set being translated, but there are internal keys that are an issue as well.

The objects in the set are complex objects that translate to __dict__, but they themselves can also contain values for their properties that could be ineligible for the basic types in the json encoder.

There’s a lot of different types coming into this set, and the hash basically calculates a unique id for the entity, but in the true spirit of NoSQL there’s no telling exactly what the child object contains.

One object might contain a date value for starts, whereas another may have some other schema that includes no keys containing “non-primitive” objects.

That is why the only solution I could think of was to extend the JSONEncoder to replace the default method to turn on different cases – but I’m not sure how to go about this and the documentation is ambiguous. In nested objects, does the value returned from default go by key, or is it just a generic include/discard that looks at the whole object? How does that method accommodate nested values? I’ve looked through previous questions and can’t seem to find the best approach to case-specific encoding (which unfortunately seems like what I’m going to need to do here).


回答 0

JSON表示法只有少数本机数据类型(对象,数组,字符串,数字,布尔值和null),因此以JSON序列化的任何内容都必须表示为这些类型之一。

json模块docs所示,此转换可以由JSONEncoderJSONDecoder自动完成,但随后您将放弃可能需要的其他一些结构(如果将集转换为列表,则将失去恢复常规数据的能力。列表;如果使用将集转换为字典,dict.fromkeys(s)则将失去恢复字典的能力)。

一个更复杂的解决方案是构建可以与其他本机JSON类型共存的自定义类型。这使您可以存储嵌套结构,其中包括列表,集合,字典,小数,日期时间对象等:

from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, unicode, int, float, bool, type(None))):
            return JSONEncoder.default(self, obj)
        return {'_python_object': pickle.dumps(obj)}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(str(dct['_python_object']))
    return dct

这是一个示例会话,显示它可以处理列表,字典和集合:

>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]

>>> j = dumps(data, cls=PythonObjectEncoder)

>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {u'key': u'value'}, Decimal('3.14')]

另外,使用更通用的序列化技术(例如YAMLTwisted Jelly或Python的pickle模块)可能很有用。它们每个都支持更大范围的数据类型。

JSON notation has only a handful of native datatypes (objects, arrays, strings, numbers, booleans, and null), so anything serialized in JSON needs to be expressed as one of these types.

As shown in the json module docs, this conversion can be done automatically by a JSONEncoder and JSONDecoder, but then you would be giving up some other structure you might need (if you convert sets to a list, then you lose the ability to recover regular lists; if you convert sets to a dictionary using dict.fromkeys(s) then you lose the ability to recover dictionaries).

A more sophisticated solution is to build-out a custom type that can coexist with other native JSON types. This lets you store nested structures that include lists, sets, dicts, decimals, datetime objects, etc.:

from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, unicode, int, float, bool, type(None))):
            return JSONEncoder.default(self, obj)
        return {'_python_object': pickle.dumps(obj)}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(str(dct['_python_object']))
    return dct

Here is a sample session showing that it can handle lists, dicts, and sets:

>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]

>>> j = dumps(data, cls=PythonObjectEncoder)

>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {u'key': u'value'}, Decimal('3.14')]

Alternatively, it may be useful to use a more general purpose serialization technique such as YAML, Twisted Jelly, or Python’s pickle module. These each support a much greater range of datatypes.


回答 1

您可以创建一个自定义编码器,list当遇到时将返回set。这是一个例子:

>>> import json
>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5]), cls=SetEncoder)
'[1, 2, 3, 4, 5]'

您也可以通过这种方式检测其他类型。如果需要保留列表实际上是一个集合,则可以使用自定义编码。类似的东西return {'type':'set', 'list':list(obj)}可能会起作用。

要说明嵌套类型,请考虑将其序列化:

>>> class Something(object):
...    pass
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)

这将引发以下错误:

TypeError: <__main__.Something object at 0x1691c50> is not JSON serializable

这表明编码器将获取list返回的结果,并对其子代递归调用序列化器。要为多种类型添加自定义序列化程序,可以执行以下操作:

>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       if isinstance(obj, Something):
...          return 'CustomSomethingRepresentation'
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)
'[1, 2, 3, 4, 5, "CustomSomethingRepresentation"]'

You can create a custom encoder that returns a list when it encounters a set. Here’s an example:

>>> import json
>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5]), cls=SetEncoder)
'[1, 2, 3, 4, 5]'

You can detect other types this way too. If you need to retain that the list was actually a set, you could use a custom encoding. Something like return {'type':'set', 'list':list(obj)} might work.

To illustrated nested types, consider serializing this:

>>> class Something(object):
...    pass
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)

This raises the following error:

TypeError: <__main__.Something object at 0x1691c50> is not JSON serializable

This indicates that the encoder will take the list result returned and recursively call the serializer on its children. To add a custom serializer for multiple types, you can do this:

>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       if isinstance(obj, Something):
...          return 'CustomSomethingRepresentation'
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)
'[1, 2, 3, 4, 5, "CustomSomethingRepresentation"]'

回答 2

我将Raymond Hettinger的解决方案调整为适用于python 3。

这是发生了什么变化:

  • unicode 消失了
  • 更新调用父母defaultsuper()
  • 使用base64序列化bytes型成str(因为它似乎bytes在python 3不能被转换为JSON)
from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]
j = dumps(data, cls=PythonObjectEncoder)
print(loads(j, object_hook=as_python_object))
# prints: [1, 2, 3, {'knights', 'who', 'say', 'ni'}, {'key': 'value'}, Decimal('3.14')]

I adapted Raymond Hettinger’s solution to python 3.

Here is what has changed:

  • unicode disappeared
  • updated the call to the parents’ default with super()
  • using base64 to serialize the bytes type into str (because it seems that bytes in python 3 can’t be converted to JSON)
from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]
j = dumps(data, cls=PythonObjectEncoder)
print(loads(j, object_hook=as_python_object))
# prints: [1, 2, 3, {'knights', 'who', 'say', 'ni'}, {'key': 'value'}, Decimal('3.14')]

回答 3

JSON中仅字典,列表和原始对象类型(int,字符串,布尔)可用。

Only dictionaries, Lists and primitive object types (int, string, bool) are available in JSON.


回答 4

您无需创建自定义编码器类即可提供default方法-可以将其作为关键字参数传递:

import json

def serialize_sets(obj):
    if isinstance(obj, set):
        return list(obj)

    return obj

json_str = json.dumps(set([1,2,3]), default=serialize_sets)
print(json_str)

会生成[1, 2, 3]所有受支持的Python版本。

You don’t need to make a custom encoder class to supply the default method – it can be passed in as a keyword argument:

import json

def serialize_sets(obj):
    if isinstance(obj, set):
        return list(obj)

    return obj

json_str = json.dumps(set([1,2,3]), default=serialize_sets)
print(json_str)

results in [1, 2, 3] in all supported Python versions.


回答 5

如果您只需要编码集合,而不是一般的Python对象,并且想要使其易于阅读,则可以使用Raymond Hettinger答案的简化版本:

import json
import collections

class JSONSetEncoder(json.JSONEncoder):
    """Use with json.dumps to allow Python sets to be encoded to JSON

    Example
    -------

    import json

    data = dict(aset=set([1,2,3]))

    encoded = json.dumps(data, cls=JSONSetEncoder)
    decoded = json.loads(encoded, object_hook=json_as_python_set)
    assert data == decoded     # Should assert successfully

    Any object that is matched by isinstance(obj, collections.Set) will
    be encoded, but the decoded value will always be a normal Python set.

    """

    def default(self, obj):
        if isinstance(obj, collections.Set):
            return dict(_set_object=list(obj))
        else:
            return json.JSONEncoder.default(self, obj)

def json_as_python_set(dct):
    """Decode json {'_set_object': [1,2,3]} to set([1,2,3])

    Example
    -------
    decoded = json.loads(encoded, object_hook=json_as_python_set)

    Also see :class:`JSONSetEncoder`

    """
    if '_set_object' in dct:
        return set(dct['_set_object'])
    return dct

If you only need to encode sets, not general Python objects, and want to keep it easily human-readable, a simplified version of Raymond Hettinger’s answer can be used:

import json
import collections

class JSONSetEncoder(json.JSONEncoder):
    """Use with json.dumps to allow Python sets to be encoded to JSON

    Example
    -------

    import json

    data = dict(aset=set([1,2,3]))

    encoded = json.dumps(data, cls=JSONSetEncoder)
    decoded = json.loads(encoded, object_hook=json_as_python_set)
    assert data == decoded     # Should assert successfully

    Any object that is matched by isinstance(obj, collections.Set) will
    be encoded, but the decoded value will always be a normal Python set.

    """

    def default(self, obj):
        if isinstance(obj, collections.Set):
            return dict(_set_object=list(obj))
        else:
            return json.JSONEncoder.default(self, obj)

def json_as_python_set(dct):
    """Decode json {'_set_object': [1,2,3]} to set([1,2,3])

    Example
    -------
    decoded = json.loads(encoded, object_hook=json_as_python_set)

    Also see :class:`JSONSetEncoder`

    """
    if '_set_object' in dct:
        return set(dct['_set_object'])
    return dct

回答 6

如果您只需要快速转储并且不想实现自定义编码器。您可以使用以下内容:

json_string = json.dumps(data, iterable_as_array=True)

这会将所有集合(和其他可迭代对象)转换为数组。请注意,当您解析json时,这些字段将保留为数组。如果要保留类型,则需要编写自定义编码器。

If you need just quick dump and don’t want to implement custom encoder. You can use the following:

json_string = json.dumps(data, iterable_as_array=True)

This will convert all sets (and other iterables) into arrays. Just beware that those fields will stay arrays when you parse the json back. If you want to preserve the types, you need to write custom encoder.


回答 7

公认的解决方案的一个缺点是它的输出是非常特定于python的。也就是说,人类无法观察到其原始json输出,也无法通过其他语言(例如javascript)加载该输出。例:

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)

会给你:

{"a": [44, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsESwVLBmWFcQJScQMu"}], "b": [55, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsCSwNLBGWFcQJScQMu"}]}

我可以提出一种解决方案,将集合降级为包含输出列表的字典,并在使用相同的编码器加载到python中时将其降级为集合,从而保留可观察性和语言不可知性:

from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        elif isinstance(obj, set):
            return {"__set__": list(obj)}
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '__set__' in dct:
        return set(dct['__set__'])
    elif '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)
ob = loads(j)
print(ob["a"])

这使您:

{"a": [44, {"__set__": [4, 5, 6]}], "b": [55, {"__set__": [2, 3, 4]}]}
[44, {'__set__': [4, 5, 6]}]

请注意,序列化包含具有键的元素的字典"__set__"将破坏此机制。因此__set__现在已成为保留dict键。显然,可以随意使用另一个更加模糊的密钥。

One shortcoming of the accepted solution is that its output is very python specific. I.e. its raw json output cannot be observed by a human or loaded by another language (e.g. javascript). example:

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)

Will get you:

{"a": [44, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsESwVLBmWFcQJScQMu"}], "b": [55, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsCSwNLBGWFcQJScQMu"}]}

I can propose a solution which downgrades the set to a dict containing a list on the way out, and back to a set when loaded into python using the same encoder, therefore preserving observability and language agnosticism:

from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        elif isinstance(obj, set):
            return {"__set__": list(obj)}
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '__set__' in dct:
        return set(dct['__set__'])
    elif '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)
ob = loads(j)
print(ob["a"])

Which gets you:

{"a": [44, {"__set__": [4, 5, 6]}], "b": [55, {"__set__": [2, 3, 4]}]}
[44, {'__set__': [4, 5, 6]}]

Note that serializing a dictionary which has an element with a key "__set__" will break this mechanism. So __set__ has now become a reserved dict key. Obviously feel free to use another, more deeply obfuscated key.


在Python中将字符串转换为Enum

问题:在Python中将字符串转换为Enum

我想知道将字符串转换(反序列化)为Python的Enum类的正确方法是什么。似乎可以getattr(YourEnumType, str)完成这项工作,但是我不确定它是否足够安全。

更具体地说,我想像这样将'debug'字符串转换为Enum对象:

class BuildType(Enum):
    debug = 200
    release = 400

I wonder what’s the correct way of converting (deserializing) a string to a Python’s Enum class. Seems like getattr(YourEnumType, str) does the job, but I’m not sure if it’s safe enough.

Just to be more specific, I would like to convert a 'debug'string to an Enum object like this:

class BuildType(Enum):
    debug = 200
    release = 400

回答 0

此功能已内置到枚举[1]中:

>>> from enum import Enum
>>> class Build(Enum):
...   debug = 200
...   build = 400
... 
>>> Build['debug']
<Build.debug: 200>

[1]官方文档: Enum programmatic access

This functionality is already built in to Enum [1]:

>>> from enum import Enum
>>> class Build(Enum):
...   debug = 200
...   build = 400
... 
>>> Build['debug']
<Build.debug: 200>

[1] Official docs: Enum programmatic access


回答 1

另一种选择(如果你的字符串不映射1-1到您的枚举的情况下特别有用)是一个添加staticmethod到您的Enum,如:

class QuestionType(enum.Enum):
    MULTI_SELECT = "multi"
    SINGLE_SELECT = "single"

    @staticmethod
    def from_str(label):
        if label in ('single', 'singleSelect'):
            return QuestionType.SINGLE_SELECT
        elif label in ('multi', 'multiSelect'):
            return QuestionType.MULTI_SELECT
        else:
            raise NotImplementedError

那你可以做 question_type = QuestionType.from_str('singleSelect')

Another alternative (especially useful if your strings don’t map 1-1 to your enum cases) is to add a staticmethod to your Enum, e.g.:

class QuestionType(enum.Enum):
    MULTI_SELECT = "multi"
    SINGLE_SELECT = "single"

    @staticmethod
    def from_str(label):
        if label in ('single', 'singleSelect'):
            return QuestionType.SINGLE_SELECT
        elif label in ('multi', 'multiSelect'):
            return QuestionType.MULTI_SELECT
        else:
            raise NotImplementedError

Then you can do question_type = QuestionType.from_str('singleSelect')


回答 2

def custom_enum(typename, items_dict):
    class_definition = """
from enum import Enum

class {}(Enum):
    {}""".format(typename, '\n    '.join(['{} = {}'.format(k, v) for k, v in items_dict.items()]))

    namespace = dict(__name__='enum_%s' % typename)
    exec(class_definition, namespace)
    result = namespace[typename]
    result._source = class_definition
    return result

MyEnum = custom_enum('MyEnum', {'a': 123, 'b': 321})
print(MyEnum.a, MyEnum.b)

还是需要将字符串转换为已知的 Enum?

class MyEnum(Enum):
    a = 'aaa'
    b = 123

print(MyEnum('aaa'), MyEnum(123))

要么:

class BuildType(Enum):
    debug = 200
    release = 400

print(BuildType.__dict__['debug'])

print(eval('BuildType.debug'))
print(type(eval('BuildType.debug')))    
print(eval(BuildType.__name__ + '.debug'))  # for work with code refactoring
def custom_enum(typename, items_dict):
    class_definition = """
from enum import Enum

class {}(Enum):
    {}""".format(typename, '\n    '.join(['{} = {}'.format(k, v) for k, v in items_dict.items()]))

    namespace = dict(__name__='enum_%s' % typename)
    exec(class_definition, namespace)
    result = namespace[typename]
    result._source = class_definition
    return result

MyEnum = custom_enum('MyEnum', {'a': 123, 'b': 321})
print(MyEnum.a, MyEnum.b)

Or you need to convert string to known Enum?

class MyEnum(Enum):
    a = 'aaa'
    b = 123

print(MyEnum('aaa'), MyEnum(123))

Or:

class BuildType(Enum):
    debug = 200
    release = 400

print(BuildType.__dict__['debug'])

print(eval('BuildType.debug'))
print(type(eval('BuildType.debug')))    
print(eval(BuildType.__name__ + '.debug'))  # for work with code refactoring

回答 3

我的类Java解决方案。希望它可以帮助某人…

    from enum import Enum, auto


    class SignInMethod(Enum):
        EMAIL = auto(),
        GOOGLE = auto()

        @staticmethod
        def value_of(value) -> Enum:
            for m, mm in SignInMethod.__members__.items():
                if m == value.upper():
                    return mm


    sim = SignInMethod.value_of('EMAIL')
    print("""TEST
    1). {0}
    2). {1}
    3). {2}
    """.format(sim, sim.name, isinstance(sim, SignInMethod)))

My Java-like solution to the problem. Hope it helps someone…

    from enum import Enum, auto


    class SignInMethod(Enum):
        EMAIL = auto(),
        GOOGLE = auto()

        @staticmethod
        def value_of(value) -> Enum:
            for m, mm in SignInMethod.__members__.items():
                if m == value.upper():
                    return mm


    sim = SignInMethod.value_of('EMAIL')
    print("""TEST
    1). {0}
    2). {1}
    3). {2}
    """.format(sim, sim.name, isinstance(sim, SignInMethod)))

回答 4

对@rogueleaderr答案的改进:

class QuestionType(enum.Enum):
    MULTI_SELECT = "multi"
    SINGLE_SELECT = "single"

    @classmethod
    def from_str(cls, label):
        if label in ('single', 'singleSelect'):
            return cls.SINGLE_SELECT
        elif label in ('multi', 'multiSelect'):
            return cls.MULTI_SELECT
        else:
            raise NotImplementedError

An improvement to the answer of @rogueleaderr :

class QuestionType(enum.Enum):
    MULTI_SELECT = "multi"
    SINGLE_SELECT = "single"

    @classmethod
    def from_str(cls, label):
        if label in ('single', 'singleSelect'):
            return cls.SINGLE_SELECT
        elif label in ('multi', 'multiSelect'):
            return cls.MULTI_SELECT
        else:
            raise NotImplementedError

回答 5

我只想通知这在python 3.6中不起作用

class MyEnum(Enum):
    a = 'aaa'
    b = 123

print(MyEnum('aaa'), MyEnum(123))

您将不得不像这样以元组形式提供数据

MyEnum(('aaa',))

编辑:这被证明是错误的。感谢指出我的错误的评论者

I just want to notify this does not work in python 3.6

class MyEnum(Enum):
    a = 'aaa'
    b = 123

print(MyEnum('aaa'), MyEnum(123))

You will have to give the data as a tuple like this

MyEnum(('aaa',))

EDIT: This turns out to be false. Credits to a commenter for pointing out my mistake


在PyTorch中保存经过训练的模型的最佳方法?

问题:在PyTorch中保存经过训练的模型的最佳方法?

我一直在寻找其他方法来在PyTorch中保存经过训练的模型。到目前为止,我发现了两种选择。

  1. 使用torch.save()保存模型,使用torch.load()加载模型。
  2. model.state_dict()保存训练的模型,model.load_state_dict()加载保存的模型。

我碰到过这种讨论,其中建议方法2优于方法1。

我的问题是,为什么选择第二种方法呢?仅仅是因为torch.nn模块具有这两个功能,我们被鼓励使用它们吗?

I was looking for alternative ways to save a trained model in PyTorch. So far, I have found two alternatives.

  1. torch.save() to save a model and torch.load() to load a model.
  2. model.state_dict() to save a trained model and model.load_state_dict() to load the saved model.

I have come across to this discussion where approach 2 is recommended over approach 1.

My question is, why the second approach is preferred? Is it only because torch.nn modules have those two function and we are encouraged to use them?


回答 0

我在他们的github仓库中找到了此页面,我将内容粘贴在这里。


推荐的模型保存方法

序列化和还原模型有两种主要方法。

第一个(推荐)仅保存和加载模型参数:

torch.save(the_model.state_dict(), PATH)

然后再:

the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))

第二个保存并加载整个模型:

torch.save(the_model, PATH)

然后再:

the_model = torch.load(PATH)

但是,在这种情况下,序列化的数据将绑定到所使用的特定类和确切的目录结构,因此在其他项目中使用时或经过一些严重的重构后,它可能以各种方式中断。

I’ve found this page on their github repo, I’ll just paste the content here.


Recommended approach for saving a model

There are two main approaches for serializing and restoring a model.

The first (recommended) saves and loads only the model parameters:

torch.save(the_model.state_dict(), PATH)

Then later:

the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))

The second saves and loads the entire model:

torch.save(the_model, PATH)

Then later:

the_model = torch.load(PATH)

However in this case, the serialized data is bound to the specific classes and the exact directory structure used, so it can break in various ways when used in other projects, or after some serious refactors.


回答 1

这取决于您想做什么。

案例1:保存模型以供您自己进行推断:保存模型,还原模型,然后将模型更改为评估模式。这样做是因为您通常在构造上具有BatchNormDropout图层,这些图层默认情况下处于训练模式:

torch.save(model.state_dict(), filepath)

#Later to restore:
model.load_state_dict(torch.load(filepath))
model.eval()

案例2:保存模型以便以后继续训练:如果您需要继续训练将要保存的模型,则需要保存的不仅仅是模型。您还需要保存优化器的状态,时期,得分等。您可以这样操作:

state = {
    'epoch': epoch,
    'state_dict': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    ...
}
torch.save(state, filepath)

要恢复训练,您将执行以下操作:state = torch.load(filepath),然后恢复每个对象的状态,如下所示:

model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(state['optimizer'])

由于您正在恢复训练,因此在加载时恢复状态后,请勿model.eval()再调用。

案例3:无法访问您的代码的其他人可以使用的模型:在Tensorflow中,您可以创建一个.pb文件,该文件定义了体系结构和模型权重。这非常方便,尤其是在使用时Tensorflow serve。在Pytorch中执行此操作的等效方法是:

torch.save(model, filepath)

# Then later:
model = torch.load(filepath)

这种方式仍然不能保证安全,并且由于pytorch仍在进行大量更改,因此我不建议这样做。

It depends on what you want to do.

Case # 1: Save the model to use it yourself for inference: You save the model, you restore it, and then you change the model to evaluation mode. This is done because you usually have BatchNorm and Dropout layers that by default are in train mode on construction:

torch.save(model.state_dict(), filepath)

#Later to restore:
model.load_state_dict(torch.load(filepath))
model.eval()

Case # 2: Save model to resume training later: If you need to keep training the model that you are about to save, you need to save more than just the model. You also need to save the state of the optimizer, epochs, score, etc. You would do it like this:

state = {
    'epoch': epoch,
    'state_dict': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    ...
}
torch.save(state, filepath)

To resume training you would do things like: state = torch.load(filepath), and then, to restore the state of each individual object, something like this:

model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(state['optimizer'])

Since you are resuming training, DO NOT call model.eval() once you restore the states when loading.

Case # 3: Model to be used by someone else with no access to your code: In Tensorflow you can create a .pb file that defines both the architecture and the weights of the model. This is very handy, specially when using Tensorflow serve. The equivalent way to do this in Pytorch would be:

torch.save(model, filepath)

# Then later:
model = torch.load(filepath)

This way is still not bullet proof and since pytorch is still undergoing a lot of changes, I wouldn’t recommend it.


回答 2

泡菜的Python库实现二进制协议的序列化和反序列化Python对象。

当您import torch(或当您使用PyTorch)时,它将import pickle为您而您不需要调用pickle.dump()pickle.load()直接调用,这是保存和加载对象的方法。

事实上,torch.save()torch.load()将包裹pickle.dump()pickle.load()为您服务。

一个state_dict对方的回答值得提及的只是几个音符。

什么state_dict我们有内部PyTorch?实际上有两个state_dict秒。

PyTorch模型torch.nn.Module具有model.parameters()调用以获取可学习的参数(w和b)。这些可学习的参数,一旦被随机设置,将随着我们的学习而随着时间而更新。可学习的参数是第一个state_dict

第二个state_dict是优化器状态字典。您还记得优化器用于改善我们的可学习参数。但是优化器state_dict是固定的。在那没什么可学的。

由于state_dict对象是Python字典,因此可以轻松地保存,更新,更改和还原对象,从而为PyTorch模型和优化器增加了很多模块化。

让我们创建一个超级简单的模型来解释这一点:

import torch
import torch.optim as optim

model = torch.nn.Linear(5, 2)

# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("Model weight:")    
print(model.weight)

print("Model bias:")    
print(model.bias)

print("---")
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

此代码将输出以下内容:

Model's state_dict:
weight   torch.Size([2, 5])
bias     torch.Size([2])
Model weight:
Parameter containing:
tensor([[ 0.1328,  0.1360,  0.1553, -0.1838, -0.0316],
        [ 0.0479,  0.1760,  0.1712,  0.2244,  0.1408]], requires_grad=True)
Model bias:
Parameter containing:
tensor([ 0.4112, -0.0733], requires_grad=True)
---
Optimizer's state_dict:
state    {}
param_groups     [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [140695321443856, 140695321443928]}]

请注意,这是最小模型。您可以尝试添加顺序堆栈

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.Conv2d(A, B, C)
          torch.nn.Linear(H, D_out),
        )

请注意,只有具有可学习参数的层(卷积层,线性层等)和已注册的缓冲区(batchnorm层)才在模型的中具有条目state_dict

不可学习的东西属于优化器对象state_dict,该对象包含有关优化器状态以及所用超参数的信息。

故事的其余部分是相同的。在推理阶段(这是我们训练后使用模型的阶段)进行预测;我们会根据所学的参数进行预测。因此,为了进行推断,我们只需要保存参数model.state_dict()

torch.save(model.state_dict(), filepath)

并在以后使用model.load_state_dict(torch.load(filepath))model.eval()

注意:不要忘了最后一行,model.eval()在加载模型之后,这是至关重要的。

也不要试图保存torch.save(model.parameters(), filepath)。该model.parameters()只是生成对象。

另一方面,torch.save(model, filepath)保存模型对象本身,但请记住,模型没有优化程序state_dict。检查@Jadiel de Armas的其他出色答案,以保存优化程序的状态字典。

The pickle Python library implements binary protocols for serializing and de-serializing a Python object.

When you import torch (or when you use PyTorch) it will import pickle for you and you don’t need to call pickle.dump() and pickle.load() directly, which are the methods to save and to load the object.

In fact, torch.save() and torch.load() will wrap pickle.dump() and pickle.load() for you.

A state_dict the other answer mentioned deserves just few more notes.

What state_dict do we have inside PyTorch? There are actually two state_dicts.

The PyTorch model is torch.nn.Module has model.parameters() call to get learnable parameters (w and b). These learnable parameters, once randomly set, will update over time as we learn. Learnable parameters are the first state_dict.

The second state_dict is the optimizer state dict. You recall that the optimizer is used to improve our learnable parameters. But the optimizer state_dict is fixed. Nothing to learn in there.

Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.

Let’s create a super simple model to explain this:

import torch
import torch.optim as optim

model = torch.nn.Linear(5, 2)

# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("Model weight:")    
print(model.weight)

print("Model bias:")    
print(model.bias)

print("---")
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

This code will output the following:

Model's state_dict:
weight   torch.Size([2, 5])
bias     torch.Size([2])
Model weight:
Parameter containing:
tensor([[ 0.1328,  0.1360,  0.1553, -0.1838, -0.0316],
        [ 0.0479,  0.1760,  0.1712,  0.2244,  0.1408]], requires_grad=True)
Model bias:
Parameter containing:
tensor([ 0.4112, -0.0733], requires_grad=True)
---
Optimizer's state_dict:
state    {}
param_groups     [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [140695321443856, 140695321443928]}]

Note this is a minimal model. You may try to add stack of sequential

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.Conv2d(A, B, C)
          torch.nn.Linear(H, D_out),
        )

Note that only layers with learnable parameters (convolutional layers, linear layers, etc.) and registered buffers (batchnorm layers) have entries in the model’s state_dict.

Non learnable things, belong to the optimizer object state_dict, which contains information about the optimizer’s state, as well as the hyperparameters used.

The rest of the story is the same; in the inference phase (this is a phase when we use the model after training) for predicting; we do predict based on the parameters we learned. So for the inference, we just need to save the parameters model.state_dict().

torch.save(model.state_dict(), filepath)

And to use later model.load_state_dict(torch.load(filepath)) model.eval()

Note: Don’t forget the last line model.eval() this is crucial after loading the model.

Also don’t try to save torch.save(model.parameters(), filepath). The model.parameters() is just the generator object.

On the other side, torch.save(model, filepath) saves the model object itself, but keep in mind the model doesn’t have the optimizer’s state_dict. Check the other excellent answer by @Jadiel de Armas to save the optimizer’s state dict.


回答 3

常见的PyTorch约定是使用.pt或.pth文件扩展名保存模型。

保存/加载整个模型 保存:

path = "username/directory/lstmmodelgpu.pth"
torch.save(trainer, path)

加载:

模型类必须在某处定义

model = torch.load(PATH)
model.eval()

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Save/Load Entire Model Save:

path = "username/directory/lstmmodelgpu.pth"
torch.save(trainer, path)

Load:

Model class must be defined somewhere

model = torch.load(PATH)
model.eval()

回答 4

如果您要保存模型并希望以后继续训练,请执行以下操作:

单个GPU: 保存:

state = {
        'epoch': epoch,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

加载:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

多GPU: 保存

state = {
        'epoch': epoch,
        'state_dict': model.module.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

加载:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

#Don't call DataParallel before loading the model otherwise you will get an error

model = nn.DataParallel(model) #ignore the line if you want to load on Single GPU

If you want to save the model and wants to resume the training later:

Single GPU: Save:

state = {
        'epoch': epoch,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

Load:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

Multiple GPU: Save

state = {
        'epoch': epoch,
        'state_dict': model.module.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

Load:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

#Don't call DataParallel before loading the model otherwise you will get an error

model = nn.DataParallel(model) #ignore the line if you want to load on Single GPU