标签归档:mongodb-query

mongodb:如果不存在则插入

问题:mongodb:如果不存在则插入

每天,我都会收到一堆文件(更新)。我想要做的是插入每个尚不存在的项目。

  • 我还想跟踪我第一次插入它们以及上次在更新中看到它们的情况。
  • 我不想有重复的文件。
  • 我不想删除以前已保存但不在我的更新中的文档。
  • 每天有95%(估计)的记录未修改。

我正在使用Python驱动程序(pymongo)。

我目前正在做的是(伪代码):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

我的问题是它非常慢(少于10万条记录需要40分钟,而我在更新中有数百万条记录)。我很确定有一些内置函数可以执行此操作,但是update()的文档是mmmhhh ….有点简洁….(http://www.mongodb.org/display/DOCS/Updating

有人可以建议如何更快地做到吗?

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don’t want to have duplicate documents.
  • I don’t want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh…. a bit terse…. (http://www.mongodb.org/display/DOCS/Updating )

Can someone advise how to do it faster?


回答 0

听起来您想执行“ upsert”。MongoDB对此具有内置支持。将一个额外的参数传递给您的update()调用:{upsert:true}。例如:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

这将完全替换if-find-else-update块。如果密钥不存在,它将插入;如果密钥不存在,它将更新。

之前:

{"key":"value", "key2":"Ohai."}

后:

{"key":"value", "key2":"value2", "key3":"value3"}

您还可以指定要写入的数据:

data = {"$set":{"key2":"value2"}}

现在,您选择的文档将仅更新“ key2”的值,而其他所有内容保持不变。

Sounds like you want to do an “upsert”. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

This replaces your if-find-else-update block entirely. It will insert if the key doesn’t exist and will update if it does.

Before:

{"key":"value", "key2":"Ohai."}

After:

{"key":"value", "key2":"value2", "key3":"value3"}

You can also specify what data you want to write:

data = {"$set":{"key2":"value2"}}

Now your selected document will update the value of “key2” only and leave everything else untouched.


回答 1

从MongoDB 2.4开始,您可以使用$ setOnInsert(http://docs.mongodb.org/manual/reference/operator/setOnInsert/

在$ upsert命令中使用$ setOnInsert设置“插入日期”,并使用$ set设置“ last_update_date”。

要将您的伪代码变成一个可行的示例:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        {"_id": document["_id"]},
        {
            "$setOnInsert": {"insertion_date": now},
            "$set": {"last_update_date": now},
        },
        upsert=True,
    )

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)

Set ‘insertion_date’ using $setOnInsert and ‘last_update_date’ using $set in your upsert command.

To turn your pseudocode into a working example:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        {"_id": document["_id"]},
        {
            "$setOnInsert": {"insertion_date": now},
            "$set": {"last_update_date": now},
        },
        upsert=True,
    )

回答 2

您始终可以创建一个唯一索引,这将导致MongoDB拒绝有冲突的保存。考虑使用mongodb shell完成以下操作:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }

回答 3

您可以将Upsert与$ setOnInsert运算符一起使用。

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

You may use Upsert with $setOnInsert operator.

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

回答 4

1.使用更新。

从上述Van Nguyen的答案得出的结论,请使用update而不是save。这使您可以访问upsert选项。

注意:发现时,此方法将覆盖整个文档(来自docs

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.a. 使用$ set

如果要更新选择的文档而不是整个文档,则可以将$ set方法与update一起使用。(再次,来自docs)…因此,如果您要设置…

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

发送为…

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

这有助于防止使用意外覆盖您的所有文档{ name: 'jason borne' }

1. Use Update.

Drawing from Van Nguyen’s answer above, use update instead of save. This gives you access to the upsert option.

NOTE: This method overrides the entire document when found (From the docs)

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.a. Use $set

If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)… So, if you want to set…

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

Send it as…

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.


回答 5

摘要

  • 您已有一个记录集合。
  • 您有一组记录,其中包含对现有记录的更新。
  • 有些更新并不会真正更新任何内容,它们会复制您已经拥有的内容。
  • 所有更新都包含已经存在的相同字段,可能只是不同的值。
  • 您想跟踪记录的最后更改时间,值实际更改的位置。

注意,我假设是PyMongo,请更改为适合您选择的语言。

说明:

  1. 使用具有unique = true的索引创建集合,这样就不会得到重复的记录。

  2. 遍历您的输入记录,创建一批约15,000条记录。对于批处理中的每个记录,创建一个由要插入的数据组成的字典,并假设每个记录将成为新记录。将“创建的”和“更新的”时间戳添加到其中。将其作为带有“ ContinueOnError”标志= true的批处理插入命令发出,因此即使其中存在重复的键(听起来也将如此),也会插入其他所有内容。这将很快发生。大量插入岩石,我获得了每秒15k的性能水平。有关ContinueOnError的更多说明,请参见 http://docs.mongodb.org/manual/core/write-operations/

    记录插入发生得非常快,因此您将立即完成这些插入操作。现在,该更新相关记录了。批量检索可以做到这一点,比一次检索要快得多。

  3. 再次遍历所有输入记录,创建15K左右的批次。提取密钥(如果有一个密钥,则最好,但如果没有密钥,则无济于事)。使用db.collectionNameBlah.find({field:{$ in:[1,2,3 …})查询从Mongo检索这堆记录。对于每个记录,确定是否有更新,如果有,则发布更新,包括更新“已更新”的时间戳。

    不幸的是,我们应该注意,MongoDB 2.4及以下版本不包含批量更新操作。他们正在为此努力。

优化要点:

  • 刀片将极大地加快批量生产的速度。
  • 整体检索记录也会加快速度。
  • 个别更新是目前唯一可行的方法,但10Gen正在研究中。大概是2.6版本,尽管我不确定它是否会在那时完成,但是还有很多事情要做(我一直在遵循他们的Jira系统)。

Summary

  • You have an existing collection of records.
  • You have a set records that contain updates to the existing records.
  • Some of the updates don’t really update anything, they duplicate what you have already.
  • All updates contain the same fields that are there already, just possibly different values.
  • You want to track when a record was last changed, where a value actually changed.

Note, I’m presuming PyMongo, change to suit your language of choice.

Instructions:

  1. Create the collection with an index with unique=true so you don’t get duplicate records.

  2. Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the ‘created’ and ‘updated’ timestamps to these. Issue this as a batch insert command with the ‘ContinueOnError’ flag=true, so the insert of everything else happens even if there’s a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I’ve gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/

    Record inserts happen VERY fast, so you’ll be done with those inserts in no time. Now, it’s time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.

  3. Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there’s one key, but can’t be helped if there isn’t). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 …}) query. For each of these records, determine if there’s an update, and if so, issue the update, including updating the ‘updated’ timestamp.

    Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They’re working on that.

Key Optimization Points:

  • The inserts will vastly speed up your operations in bulk.
  • Retrieving records en masse will speed things up, too.
  • Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I’m not sure if it will be finished by then, there’s a lot of stuff to do (I’ve been following their Jira system).

回答 6

我认为mongodb不支持这种选择性的upserting。我有与LeMiz相同的问题,并且在处理“创建的”和“更新的”时间戳时,使用update(criteria,newObj,upsert,multi)无法正常工作。给出以下upsert语句:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

方案#1-‘名称’为’abc’的文档不存在:使用’名称’=’abc’,’创建’= 2010-07-14 11:11:11和’已更新’=创建新文档2010-07-14 11:11:11。

方案#2-“名称”为“ abc”的文档已存在,且具有以下内容:“名称” =“ abc”,“创建的” = 2010-07-12 09:09:09和“更新的” = 2010-07 -13 10:10:10。更新之后,该文档现在将与方案1中的结果相同。无法在upsert中指定在插入时要设置的字段,在更新时要保留的字段。

我的解决方案是在critera字段上创建唯一索引,执行插入操作,然后立即在“ updated”字段上执行更新。

I don’t think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn’t work right when dealing with both a ‘created’ and ‘updated’ timestamp. Given the following upsert statement:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

Scenario #1 – document with ‘name’ of ‘abc’ does not exist: New document is created with ‘name’ = ‘abc’, ‘created’ = 2010-07-14 11:11:11, and ‘updated’ = 2010-07-14 11:11:11.

Scenario #2 – document with ‘name’ of ‘abc’ already exists with the following: ‘name’ = ‘abc’, ‘created’ = 2010-07-12 09:09:09, and ‘updated’ = 2010-07-13 10:10:10. After the upsert, the document would now be the same as the result in scenario #1. There’s no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.

My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the ‘updated’ field.


回答 7

通常,在MongoDB中使用update更好,因为如果尚不存在,它将仅创建文档,尽管我不确定如何使用python适配器。

其次,如果您只需要知道该文档是否存在,则只返回一个数字的count()会比find_one更好,因为后者可能会从MongoDB传输整个文档,从而导致不必要的流量。

In general, using update is better in MongoDB as it will just create the document if it doesn’t exist yet, though I’m not sure how to work that with your python adapter.

Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.