标签归档:pymongo

JSON ValueError:期望的属性名称:第1行第2列(字符1)

问题:JSON ValueError:期望的属性名称:第1行第2列(字符1)

我在使用json.loads转换为dict对象时遇到麻烦,我无法弄清楚我在做什么错。我得到的确切错误是

ValueError: Expecting property name: line 1 column 2 (char 1)

这是我的代码:

from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
from kafka.producer import SimpleProducer, KeyedProducer
import pymongo
from pymongo import MongoClient
import json

c = MongoClient("54.210.157.57")
db = c.test_database3
collection = db.tweet_col

kafka = KafkaClient("54.210.157.57:9092")

consumer = SimpleConsumer(kafka,"myconsumer","test")
for tweet in consumer:
    print tweet.message.value
    jsonTweet=json.loads(({u'favorited': False, u'contributors': None})
    collection.insert(jsonTweet)

我很确定错误发生在第二行到最后一行

jsonTweet=json.loads({u'favorited': False, u'contributors': None})

但我不知道该如何解决。任何意见,将不胜感激。

I am having trouble using json.loads to convert to a dict object and I can’t figure out what I’m doing wrong.The exact error I get running this is

ValueError: Expecting property name: line 1 column 2 (char 1)

Here is my code:

from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
from kafka.producer import SimpleProducer, KeyedProducer
import pymongo
from pymongo import MongoClient
import json

c = MongoClient("54.210.157.57")
db = c.test_database3
collection = db.tweet_col

kafka = KafkaClient("54.210.157.57:9092")

consumer = SimpleConsumer(kafka,"myconsumer","test")
for tweet in consumer:
    print tweet.message.value
    jsonTweet=json.loads(({u'favorited': False, u'contributors': None})
    collection.insert(jsonTweet)

I’m pretty sure that the error is occuring at the 2nd to last line

jsonTweet=json.loads({u'favorited': False, u'contributors': None})

but I do not know what to do to fix it. Any advice would be appreciated.


回答 0

json.loads将json字符串加载到python中dictjson.dumps将python转储dict到json字符串中,例如:

>>> json_string = '{"favorited": false, "contributors": null}'
'{"favorited": false, "contributors": null}'
>>> value = json.loads(json_string)
{u'favorited': False, u'contributors': None}
>>> json_dump = json.dumps(value)
'{"favorited": false, "contributors": null}'

所以那行是不正确的,因为您正在尝试load使用python dict,并json.loads期望json string应该有一个有效的python <type 'str'>

因此,如果您尝试加载json,则应更改要加载的内容,使其类似于json_string上面的内容,否则应将其转储。根据给定的信息,这只是我的最佳猜测。您要完成什么?

另外,您也不需要u在字符串前指定,就像注释中提到的@Cld一样。

json.loads will load a json string into a python dict, json.dumps will dump a python dict to a json string, for example:

>>> json_string = '{"favorited": false, "contributors": null}'
'{"favorited": false, "contributors": null}'
>>> value = json.loads(json_string)
{u'favorited': False, u'contributors': None}
>>> json_dump = json.dumps(value)
'{"favorited": false, "contributors": null}'

So that line is incorrect since you are trying to load a python dict, and json.loads is expecting a valid json string which should have <type 'str'>.

So if you are trying to load the json, you should change what you are loading to look like the json_string above, or you should be dumping it. This is just my best guess from the given information. What is it that you are trying to accomplish?

Also you don’t need to specify the u before your strings, as @Cld mentioned in the comments.


回答 1

我遇到另一个返回相同错误的问题。

单引号

我使用带单引号的json字符串:

{
    'property': 1
}

但是json.loads只接受json属性的双引号

{
    "property": 1
}

最终逗号问题

json.loads 不接受最终逗号:

{
  "property": "text", 
  "property2": "text2",
}

解决方案:ast解决单引号和最终逗号问题

您可以使用ast(Python 2和3的标准库的一部分)进行此处理。这是一个例子:

import ast
# ast.literal_eval() return a dict object, we must use json.dumps to get JSON string
import json

# Single quote to double with ast.literal_eval()
json_data = "{'property': 'text'}"
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property": "text"}

# ast.literal_eval() with double quotes
json_data = '{"property": "text"}'
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property": "text"}

# ast.literal_eval() with final coma
json_data = "{'property': 'text', 'property2': 'text2',}"
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property2": "text2", "property": "text"}

使用ast会像Python字典一样插入JSON,从而避免出现单引号和最终逗号问题(因此,您必须遵循Python字典语法)。它是eval()文字结构函数的一种非常不错且安全的替代方法。

Python文档警告我们使用大/复杂字符串:

警告由于Python AST编译器中的堆栈深度限制,使用足够大/复杂的字符串可能会使Python解释器崩溃。

带有单引号的json.dumps

json.dumps轻松使用单引号,可以使用以下代码:

import ast
import json

data = json.dumps(ast.literal_eval(json_data_single_quote))

ast 文件资料

ast Python 3 doc

ast Python 2文档

工具

如果您经常编辑JSON,则可以使用CodeBeautify。它可以帮助您修复语法错误并缩小/美化JSON。

希望对您有所帮助。

I encountered another problem that returns the same error.

Single quote issue

I used a json string with single quotes :

{
    'property': 1
}

But json.loads accepts only double quotes for json properties :

{
    "property": 1
}

Final comma issue

json.loads doesn’t accept a final comma:

{
  "property": "text", 
  "property2": "text2",
}

Solution: ast to solve single quote and final comma issues

You can use ast (part of standard library for both Python 2 and 3) for this processing. Here is an example :

import ast
# ast.literal_eval() return a dict object, we must use json.dumps to get JSON string
import json

# Single quote to double with ast.literal_eval()
json_data = "{'property': 'text'}"
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property": "text"}

# ast.literal_eval() with double quotes
json_data = '{"property": "text"}'
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property": "text"}

# ast.literal_eval() with final coma
json_data = "{'property': 'text', 'property2': 'text2',}"
json_data = ast.literal_eval(json_data)
print(json.dumps(json_data))
# Displays : {"property2": "text2", "property": "text"}

Using ast will prevent you from single quote and final comma issues by interpet the JSON like Python dictionnary (so you must follow the Python dictionnary syntax). It’s a pretty good and safely alternative of eval() function for literal structures.

Python documentation warned us of using large/complex string :

Warning It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.

json.dumps with single quotes

To use json.dumps with single quotes easily you can use this code:

import ast
import json

data = json.dumps(ast.literal_eval(json_data_single_quote))

ast documentation

ast Python 3 doc

ast Python 2 doc

Tool

If you frequently edit JSON, you may use CodeBeautify. It helps you to fix syntax error and minify/beautify JSON.

I hope it helps.


回答 2

  1. 用双引号替换所有单引号
  2. 将字符串中的’u“’替换为’”’…因此,在将字符串加载到json之前,基本上将内部unicodes转换为字符串
>> strs = "{u'key':u'val'}"
>> strs = strs.replace("'",'"')
>> json.loads(strs.replace('u"','"'))
  1. replace all single quotes with double quotes
  2. replace ‘u”‘ from your strings to ‘”‘ … so basically convert internal unicodes to strings before loading the string into json
>> strs = "{u'key':u'val'}"
>> strs = strs.replace("'",'"')
>> json.loads(strs.replace('u"','"'))

回答 3

所有其他答案都可以回答您的查询,但是我遇到了同样的问题,这是由于,我在json字符串的末尾添加了这样的杂散:

{
 "key":"123sdf",
 "bus_number":"asd234sdf",
}

当我,像这样删除多余的东西时,我终于使它工作了:

{
 "key":"123sdf",
 "bus_number":"asd234sdf"
}

希望有帮助!干杯。

All other answers may answer your query, but I faced same issue which was due to stray , which I added at the end of my json string like this:

{
 "key":"123sdf",
 "bus_number":"asd234sdf",
}

I finally got it working when I removed extra , like this:

{
 "key":"123sdf",
 "bus_number":"asd234sdf"
}

Hope this help! cheers.


回答 4

使用ast,例如

In [15]: a = "[{'start_city': '1', 'end_city': 'aaa', 'number': 1},\
...:      {'start_city': '2', 'end_city': 'bbb', 'number': 1},\
...:      {'start_city': '3', 'end_city': 'ccc', 'number': 1}]"
In [16]: import ast
In [17]: ast.literal_eval(a)
Out[17]:
[{'end_city': 'aaa', 'number': 1, 'start_city': '1'},
 {'end_city': 'bbb', 'number': 1, 'start_city': '2'},
 {'end_city': 'ccc', 'number': 1, 'start_city': '3'}]

used ast, example

In [15]: a = "[{'start_city': '1', 'end_city': 'aaa', 'number': 1},\
...:      {'start_city': '2', 'end_city': 'bbb', 'number': 1},\
...:      {'start_city': '3', 'end_city': 'ccc', 'number': 1}]"
In [16]: import ast
In [17]: ast.literal_eval(a)
Out[17]:
[{'end_city': 'aaa', 'number': 1, 'start_city': '1'},
 {'end_city': 'bbb', 'number': 1, 'start_city': '2'},
 {'end_city': 'ccc', 'number': 1, 'start_city': '3'}]

回答 5

我遇到的另一种情况是,当我使用echoJSON将其通过管道传递到python脚本中,并且不小心将JSON字符串用双引号引起来时:

echo "{"thumbnailWidth": 640}" | myscript.py

请注意,JSON字符串本身带有引号,我应该这样做:

echo '{"thumbnailWidth": 640}' | myscript.py

由于这是,这是获得什么python脚本:{thumbnailWidth: 640}; 双引号已被有效删除。

A different case in which I encountered this was when I was using echo to pipe the JSON into my python script and carelessly wrapped the JSON string in double quotes:

echo "{"thumbnailWidth": 640}" | myscript.py

Note that the JSON string itself has quotes and I should have done:

echo '{"thumbnailWidth": 640}' | myscript.py

As it was, this is what the python script received: {thumbnailWidth: 640}; the double quotes were effectively stripped.


如何从mongodb导入数据到熊猫?

问题:如何从mongodb导入数据到熊猫?

我需要分析mongodb中的集合中有大量数据。如何将这些数据导入熊猫?

我是熊猫和numpy的新手。

编辑:mongodb集合包含带有日期和时间标记的传感器值。传感器值是float数据类型。

样本数据:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?

I am new to pandas and numpy.

EDIT: The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.

Sample Data:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

回答 0

pymongo 可能会帮助您,以下是我正在使用的一些代码:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

pymongo might give you a hand, followings are some codes I’m using:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

回答 1

您可以使用此代码将mongodb数据加载到pandas DataFrame。这个对我有用。希望也能为您服务。

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

回答 2

Monary正是这样做的,而且它超级快。(另一个链接

请参阅这篇很酷的文章,其中包括快速教程和一些时间安排。

Monary does exactly that, and it’s super fast. (another link)

See this cool post which includes a quick tutorial and some timings.


回答 3

根据PEP,简单胜于复杂:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

您可以像使用常规mongoDB数据库一样包含条件,甚至可以使用find_one()从数据库中仅获取一个元素,等等。

和瞧!

As per PEP, simple is better than complicated:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.

and voila!


回答 4

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)
import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

回答 5

为了有效地处理内核外(不适合RAM)数据(即并行执行),可以尝试使用Python Blaze生态系统:Blaze / Dask / Odo。

Blaze(和Odo)具有开箱即用的功能来处理MongoDB。

一些有用的文章开始:

还有一篇文章展示了Blaze堆栈可能带来的惊人功能:使用Blaze和Impala分析17亿条Reddit注释(本质上是在几秒钟内查询975 Gb Reddit注释)。

PS我不属于任何这些技术。

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.

Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.

A few useful articles to start off:

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).

P.S. I’m not affiliated with any of these technologies.


回答 6

我发现非常有用的另一个选择是:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

这样,您就可以免费获取嵌套的mongodb文档。

Another option I found very useful is:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

this way you get the unfolding of nested mongodb documents for free.


回答 7

使用

pandas.DataFrame(list(...))

如果迭代器/生成器结果很大,将消耗大量内存

更好地生成小块并最终合并

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

Using

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large

better to generate small chunks and concat at the end

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

回答 8

http://docs.mongodb.org/manual/reference/mongoexport

导出到csv并使用read_csv 或JSON并使用DataFrame.from_records()

http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()


回答 9

waitingkuo回答了这个很好的答案之后,我想添加使用与.read_sql().read_csv()一致的 chunksize进行此操作的可能性。我通过避免“迭代器” /“光标”的每个“记录”一一列举来扩大Deu Leung的答案。我将借用以前的read_mongo函数。

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each ‘record’ of the ‘iterator’ / ‘cursor’. I will borrow previous read_mongo function.

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

回答 10

使用分页的类似方法,例如Rafael Valero,waitingkuo和Deu Leung :

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

回答 11

您可以通过三行使用pdmongo实现所需的功能

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

如果您的数据非常大,则可以首先通过过滤不需要的数据来进行汇总查询,然后将它们映射到所需的列。

这是映射Readings.a到列a并按列过滤的示例reportCount

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo接受与pymongo聚合相同的参数

You can achieve what you want with pdmongo in three lines:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.

Here is an example of mapping Readings.a to column a and filtering by reportCount column:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate


在PyMongo中使用.sort

问题:在PyMongo中使用.sort

使用PyMongo,当我尝试检索按其“数字”和“日期”字段排序的对象时,如下所示:

db.test.find({"number": {"$gt": 1}}).sort({"number": 1, "date": -1})

我收到此错误:

TypeError: if no direction is specified, key_or_list must be an instance of list

我的排序查询出了什么问题?

With PyMongo, when I try to retrieve objects sorted by their ‘number’ and ‘date’ fields like this:

db.test.find({"number": {"$gt": 1}}).sort({"number": 1, "date": -1})

I get this error:

TypeError: if no direction is specified, key_or_list must be an instance of list

What’s wrong with my sort query?


回答 0

sort 应该是键方向对的列表,即

db.test.find({"number": {"$gt": 1}}).sort([("number", 1), ("date", -1)])

之所以必须是列表,是因为参数的顺序很重要,而dicts在Python <3.6中不排序

sort should be a list of key-direction pairs, that is

db.test.find({"number": {"$gt": 1}}).sort([("number", 1), ("date", -1)])

The reason why this has to be a list is that the ordering of the arguments matters and dicts are not ordered in Python < 3.6


如何用pymongo排序mongodb

问题:如何用pymongo排序mongodb

我在查询mongoDB时尝试使用排序功能,但是失败了。相同的查询在MongoDB控制台中有效,但不适用于此处。代码如下:

import pymongo

from  pymongo import Connection
connection = Connection()
db = connection.myDB
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({u'entities.user_mentions.screen_name':1}):
    print post

我得到的错误如下:

Traceback (most recent call last):
  File "find_ow.py", line 7, in <module>
    for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({'entities.user_mentions.screen_name':1},1):
  File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/cursor.py", line 430, in sort
  File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/helpers.py", line 67, in _index_document
TypeError: first item in each key pair must be a string

我在其他地方找到了一个链接,该链接说如果使用pymongo,则需要在密钥的前面放置一个“ u”,但这也不起作用。任何其他人都可以使用它,或者这是一个错误。

I’m trying to use the sort feature when querying my mongoDB, but it is failing. The same query works in the MongoDB console but not here. Code is as follows:

import pymongo

from  pymongo import Connection
connection = Connection()
db = connection.myDB
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({u'entities.user_mentions.screen_name':1}):
    print post

The error I get is as follows:

Traceback (most recent call last):
  File "find_ow.py", line 7, in <module>
    for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({'entities.user_mentions.screen_name':1},1):
  File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/cursor.py", line 430, in sort
  File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/helpers.py", line 67, in _index_document
TypeError: first item in each key pair must be a string

I found a link elsewhere that says I need to place a ‘u’ infront of the key if using pymongo, but that didn’t work either. Anyone else get this to work or is this a bug.


回答 0

.sort()在pymongo中,以keydirection作为参数。

因此,假设您要排序,id那么您应该.sort("_id", 1)

对于多个字段:

.sort([("field1", pymongo.ASCENDING), ("field2", pymongo.DESCENDING)])

.sort(), in pymongo, takes key and direction as parameters.

So if you want to sort by, let’s say, id then you should .sort("_id", 1)

For multiple fields:

.sort([("field1", pymongo.ASCENDING), ("field2", pymongo.DESCENDING)])

回答 1

您可以尝试以下方法:

db.Account.find().sort("UserName")  
db.Account.find().sort("UserName",pymongo.ASCENDING)   
db.Account.find().sort("UserName",pymongo.DESCENDING)  

You can try this:

db.Account.find().sort("UserName")  
db.Account.find().sort("UserName",pymongo.ASCENDING)   
db.Account.find().sort("UserName",pymongo.DESCENDING)  

回答 2

这也适用:

db.Account.find().sort('UserName', -1)
db.Account.find().sort('UserName', 1)

我在代码中使用了它,如果我在这里做错了,请发表评论,谢谢。

This also works:

db.Account.find().sort('UserName', -1)
db.Account.find().sort('UserName', 1)

I’m using this in my code, please comment if i’m doing something wrong here, thanks.


回答 3

为什么python使用元组列表代替dict?

在python中,您不能保证字典将按照声明的顺序进行解释。

因此,在mongo shell中,您可以这样做.sort({'field1':1,'field2':1}),并且解释程序应在第一级对field1进行排序,并在第二级对field 2进行排序。

如果在Python中使用了此sintax,则有可能在第一级对field2进行排序。使用元组没有任何风险。

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

Why python uses list of tuples instead dict?

In python, you cannot guarantee that the dictionary will be interpreted in the order you declared.

So, in mongo shell you could do .sort({'field1':1,'field2':1}) and the interpreter would sort field1 at first level and field 2 at second level.

If this syntax was used in python, there is a chance of sorting by field2 at first level. With tuple, there is no such risk.

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

回答 4

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

Python使用键方向。您可以使用上述方式。

因此,您可以这样做

for post in db.posts.find().sort('entities.user_mentions.screen_name',pymongo.ASCENDING):
        print post
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

Python uses key,direction. You can use the above way.

So in your case you can do this

for post in db.posts.find().sort('entities.user_mentions.screen_name',pymongo.ASCENDING):
        print post

回答 5

TLDR:聚合管道比常规管道更快.find().sort()

现在转向真正的解释。在MongoDB中有两种执行排序操作的方法:

  1. 使用.find().sort()
  2. 或使用聚合管道。

正如许多.find()。sort()所建议的那样,这是执行排序的最简单方法。

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

但是,与聚合管道相比,这是一个缓慢的过程。

谈到聚合管道方法。实现用于排序的简单聚合管道的步骤为:

  1. $ match(可选步骤)
  2. $ sort

注意:根据我的经验,聚合管道的工作速度比该.find().sort()方法快。

这是聚合管道的示例。

db.collection_name.aggregate([{
    "$match": {
        # your query - optional step
    }
},
{
    "$sort": {
        "field_1": pymongo.ASCENDING,
        "field_2": pymongo.DESCENDING,
        ....
    }
}])

自己尝试此方法,比较速度,然后在评论中让我知道。

编辑:不要忘记allowDiskUse=True在多个字段上排序时使用,否则会抛出错误。

TLDR: Aggregation pipeline is faster as compared to conventional .find().sort().

Now moving to the real explanation. There are two ways to perform sorting operations in MongoDB:

  1. Using .find() and .sort().
  2. Or using the aggregation pipeline.

As suggested by many .find().sort() is the simplest way to perform the sorting.

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

However, this is a slow process compared to the aggregation pipeline.

Coming to the aggregation pipeline method. The steps to implement simple aggregation pipeline intended for sorting are:

  1. $match (optional step)
  2. $sort

NOTE: In my experience, the aggregation pipeline works a bit faster than the .find().sort() method.

Here’s an example of the aggregation pipeline.

db.collection_name.aggregate([{
    "$match": {
        # your query - optional step
    }
},
{
    "$sort": {
        "field_1": pymongo.ASCENDING,
        "field_2": pymongo.DESCENDING,
        ....
    }
}])

Try this method yourself, compare the speed and let me know about this in the comments.

Edit: Do not forget to use allowDiskUse=True while sorting on multiple fields otherwise it will throw an error.


回答 6

假设您要按“ created_on”字段进行排序,则可以这样做,

.sort('{}'.format('created_on'), 1 if sort_type == 'asc' else -1)

Say, you want to sort by ‘created_on’ field, then you can do like this,

.sort('{}'.format('created_on'), 1 if sort_type == 'asc' else -1)