SQLAlchemy:flush()和commit()有什么区别?

问题:SQLAlchemy:flush()和commit()有什么区别?

flush()commit()SQLAlchemy 之间有什么区别?

我已经阅读了文档,但没有一个更明智-他们似乎假设了我没有的预见性。

我对它们对内存使用量的影响特别感兴趣。我正在从一系列文件(总共约500万行)中将一些数据加载到数据库中,而我的会话有时会崩溃-这是一个大型数据库,并且一台内存不足的机器。

我想知道我使用的电话太多commit()还是不够flush()-但是如果不真正了解两者之间的区别,很难分辨!

What the difference is between flush() and commit() in SQLAlchemy?

I’ve read the docs, but am none the wiser – they seem to assume a pre-understanding that I don’t have.

I’m particularly interested in their impact on memory usage. I’m loading some data into a database from a series of files (around 5 million rows in total) and my session is occasionally falling over – it’s a large database and a machine with not much memory.

I’m wondering if I’m using too many commit() and not enough flush() calls – but without really understanding what the difference is, it’s hard to tell!


回答 0

会话对象基本上是数据库更改(更新,插入,删除)的正在进行的事务。这些操作在提交之前不会持久化到数据库中(如果您的程序在会话中事务中由于某种原因中止,则其中所有未提交的更改都将丢失)。

会话对象向注册了事务操作session.add(),但是直到session.flush()调用它之前,它们都不会将它们传递给数据库。

session.flush()将一系列操作传达给数据库(插入,更新,删除)。数据库将它们维护为事务中的挂起操作。直到数据库收到当前事务的COMMIT为止,session.commit()所做的更改才会永久保留在磁盘上,或对其他事务可见。

session.commit() 将这些更改提交(持久)到数据库。

flush()始终称为一个呼叫的一部分commit()1)。

当您使用Session对象查询数据库时,查询将同时从数据库及其所保存的未提交事务的已刷新部分返回结果。默认情况下,Session反对autoflush其操作,但是可以禁用它。

希望这个例子可以使这个更加清楚:

#---
s = Session()

s.add(Foo('A')) # The Foo('A') object has been added to the session.
                # It has not been committed to the database yet,
                #   but is returned as part of a query.
print 1, s.query(Foo).all()
s.commit()

#---
s2 = Session()
s2.autoflush = False

s2.add(Foo('B'))
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
                             #   as part of this query because it hasn't
                             #   been flushed yet.
s2.flush()                   # Now, Foo('B') is in the same state as
                             #   Foo('A') was above.
print 3, s2.query(Foo).all() 
s2.rollback()                # Foo('B') has not been committed, and rolling
                             #   back the session's transaction removes it
                             #   from the session.
print 4, s2.query(Foo).all()

#---
Output:
1 [<Foo('A')>]
2 [<Foo('A')>]
3 [<Foo('A')>, <Foo('B')>]
4 [<Foo('A')>]

A Session object is basically an ongoing transaction of changes to a database (update, insert, delete). These operations aren’t persisted to the database until they are committed (if your program aborts for some reason in mid-session transaction, any uncommitted changes within are lost).

The session object registers transaction operations with session.add(), but doesn’t yet communicate them to the database until session.flush() is called.

session.flush() communicates a series of operations to the database (insert, update, delete). The database maintains them as pending operations in a transaction. The changes aren’t persisted permanently to disk, or visible to other transactions until the database receives a COMMIT for the current transaction (which is what session.commit() does).

session.commit() commits (persists) those changes to the database.

flush() is always called as part of a call to commit() (1).

When you use a Session object to query the database, the query will return results both from the database and from the flushed parts of the uncommitted transaction it holds. By default, Session objects autoflush their operations, but this can be disabled.

Hopefully this example will make this clearer:

#---
s = Session()

s.add(Foo('A')) # The Foo('A') object has been added to the session.
                # It has not been committed to the database yet,
                #   but is returned as part of a query.
print 1, s.query(Foo).all()
s.commit()

#---
s2 = Session()
s2.autoflush = False

s2.add(Foo('B'))
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
                             #   as part of this query because it hasn't
                             #   been flushed yet.
s2.flush()                   # Now, Foo('B') is in the same state as
                             #   Foo('A') was above.
print 3, s2.query(Foo).all() 
s2.rollback()                # Foo('B') has not been committed, and rolling
                             #   back the session's transaction removes it
                             #   from the session.
print 4, s2.query(Foo).all()

#---
Output:
1 [<Foo('A')>]
2 [<Foo('A')>]
3 [<Foo('A')>, <Foo('B')>]
4 [<Foo('A')>]

回答 1

正如@snapshoe所说

flush() 将您的SQL语句发送到数据库

commit() 提交交易。

时间session.autocommit == False

commit()flush()如果您设置,会打电话autoflush == True

时间session.autocommit == True

commit()如果您尚未开始交易,则无法调用(您可能没有,因为您可能仅使用此模式来避免手动管理交易)。

在这种模式下,您必须调用flush()以保存您的ORM更改。刷新还会有效地提交您的数据。

As @snapshoe says

flush() sends your SQL statements to the database

commit() commits the transaction.

When session.autocommit == False:

commit() will call flush() if you set autoflush == True.

When session.autocommit == True:

You can’t call commit() if you haven’t started a transaction (which you probably haven’t since you would probably only use this mode to avoid manually managing transactions).

In this mode, you must call flush() to save your ORM changes. The flush effectively also commits your data.


回答 2

如果可以提交,为什么还要刷新?

作为刚接触数据库和sqlalchemy的新手,以前的答案- flush()将SQL语句发送到数据库并commit()保留它们-对我来说还不清楚。这些定义很有意义,但是从定义中并不清楚为什么您要使用冲洗而不是仅仅提交。

由于提交始终会刷新(https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing),因此听起来确实很相似。我认为要强调的大问题是,刷新不是永久的并且可以撤消,而提交是永久的,在某种意义上,您不能要求数据库撤消上一次提交(我认为)

@snapshoe突出显示,如果要查询数据库并获取包含新添加对象的结果,则需要先刷新(或提交,然后将为您刷新)。也许这对某些人有用,尽管我不确定为什么您要刷新而不是提交(除了可以撤消的琐碎回答之外)。

在另一个示例中,我正在本地数据库和远程服务器之间同步文档,并且如果用户决定取消,则应撤消所有添加/更新/删除操作(即,不进行部分同步,仅进行完全同步)。更新单个文档时,我决定只删除旧行,然后从远程服务器添加更新的版本。事实证明,由于sqlalchemy的编写方式,不能保证提交时的操作顺序。这导致添加了一个重复版本(在尝试删除旧版本之前),这导致数据库无法通过唯一约束。为了解决这个问题,我使用flush()了顺序,但是如果以后同步过程失败,我仍然可以撤消操作。

在以下位置查看我的帖子:在sqlalchemy中提交时,添加和删除是否有任何顺序

同样,有人想知道提交时是否保持添加顺序,即如果我先添加object1然后添加object2是否在将对象添加到会话时将SQLAlchemy保存object1到数据库之前object2

同样,这里假定使用flush()将确保所需的行为。因此,总而言之,刷新的一种用途是提供顺序保证(我认为),同时又允许自己使用提交不提供的“撤消”选项。

自动刷新和自动提交

注意,自动刷新可用于确保查询对更新的数据库起作用,因为sqlalchemy将在执行查询之前刷新。https://docs.sqlalchemy.org/zh/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush

自动提交是我尚不完全了解的其他东西,但听起来不鼓励使用它:https : //docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params。自动提交

内存使用情况

现在,最初的问题实际上是想了解刷新与提交对内存的影响。由于持久性或不持久性是数据库提供的(我认为),因此仅刷新就足以卸载数据库-尽管如果您不关心撤消操作,提交也不会受到损害(实际上可能会有所帮助-见下文)。 。

sqlalchemy对已刷新的对象使用弱引用:https : //docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior

这意味着,如果您没有将对象明确保留在某个地方(例如列表或dict中),则sqlalchemy不会将其保留在内存中。

但是,这时您需要担心数据库方面的问题。可能在未提交的情况下进行刷新会带来一些内存损失,以维护事务。同样,我对此并不陌生,但是这里的链接似乎恰恰表明了这一点:https : //stackoverflow.com/a/15305650/764365

换句话说,尽管应该在内存和性能之间进行权衡,但是提交应该减少内存的使用。换句话说,您可能不想一次提交每一个数据库更改(出于性能原因),但是等待太长时间会增加内存使用率。

Why flush if you can commit?

As someone new to working with databases and sqlalchemy, the previous answers – that flush() sends SQL statements to the DB and commit() persists them – were not clear to me. The definitions make sense but it isn’t immediately clear from the definitions why you would use a flush instead of just committing.

Since a commit always flushes (https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing) these sound really similar. I think the big issue to highlight is that a flush is not permanent and can be undone, whereas a commit is permanent, in the sense that you can’t ask the database to undo the last commit (I think)

@snapshoe highlights that if you want to query the database and get results that include newly added objects, you need to have flushed first (or committed, which will flush for you). Perhaps this is useful for some people although I’m not sure why you would want to flush rather than commit (other than the trivial answer that it can be undone).

In another example I was syncing documents between a local DB and a remote server, and if the user decided to cancel, all adds/updates/deletes should be undone (i.e. no partial sync, only a full sync). When updating a single document I’ve decided to simply delete the old row and add the updated version from the remote server. It turns out that due to the way sqlalchemy is written, order of operations when committing is not guaranteed. This resulted in adding a duplicate version (before attempting to delete the old one), which resulted in the DB failing a unique constraint. To get around this I used flush() so that order was maintained, but I could still undo if later the sync process failed.

See my post on this at: Is there any order for add versus delete when committing in sqlalchemy

Similarly, someone wanted to know whether add order is maintained when committing, i.e. if I add object1 then add object2, does object1 get added to the database before object2 Does SQLAlchemy save order when adding objects to session?

Again, here presumably the use of a flush() would ensure the desired behavior. So in summary, one use for flush is to provide order guarantees (I think), again while still allowing yourself an “undo” option that commit does not provide.

Autoflush and Autocommit

Note, autoflush can be used to ensure queries act on an updated database as sqlalchemy will flush before executing the query. https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush

Autocommit is something else that I don’t completely understand but it sounds like its use is discouraged: https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autocommit

Memory Usage

Now the original question actually wanted to know about the impact of flush vs. commit for memory purposes. As the ability to persist or not is something the database offers (I think), simply flushing should be sufficient to offload to the database – although committing shouldn’t hurt (actually probably helps – see below) if you don’t care about undoing.

sqlalchemy uses weak referencing for objects that have been flushed: https://docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior

This means if you don’t have an object explicitly held onto somewhere, like in a list or dict, sqlalchemy won’t keep it in memory.

However, then you have the database side of things to worry about. Presumably flushing without committing comes with some memory penalty to maintain the transaction. Again, I’m new to this but here’s a link that seems to suggest exactly this: https://stackoverflow.com/a/15305650/764365

In other words, commits should reduce memory usage, although presumably there is a trade-off between memory and performance here. In other words, you probably don’t want to commit every single database change, one at a time (for performance reasons), but waiting too long will increase memory usage.


回答 3

这并不能严格回答最初的问题,但是有些人提到session.autoflush = True不必与您一起使用session.flush()……而且并非总是如此。

如果要在事务中间使用新创建的对象的ID,则必须调用session.flush()

# Given a model with at least this id
class AModel(Base):
   id = Column(Integer, primary_key=True)  # autoincrement by default on integer primary key

session.autoflush = True

a = AModel()
session.add(a)
a.id  # None
session.flush()
a.id  # autoincremented integer

这是因为autoflush自动填写ID(尽管对象的查询将,这有时会导致混乱,如“为什么这个作品在这里,但不是吗?”但是,snapshoe已涵盖这一部分)。


一个相关方面对我来说似乎很重要,但并未真正提及:

为什么不一直承诺呢?-答案是原子性

可以这么说:所有操作必须全部成功执行,否则任何一个都不会生效。

例如,如果要创建/更新/删除某个对象(A),然后创建/更新/删除另一个对象(B),但是如果(B)失败,则要还原(A)。这意味着这两个操作是原子操作。

因此,如果(B)需要(A)的结果,则要flush在(A)commit之后和(B)之后调用。

另外,如果是session.autoflush is True,除了我上面提到的情况或Jimbo的回答中的其他情况外,您将不需要flush手动调用。

This does not strictly answer the original question but some people have mentioned that with session.autoflush = True you don’t have to use session.flush()… And this is not always true.

If you want to use the id of a newly created object in the middle of a transaction, you must call session.flush().

# Given a model with at least this id
class AModel(Base):
   id = Column(Integer, primary_key=True)  # autoincrement by default on integer primary key

session.autoflush = True

a = AModel()
session.add(a)
a.id  # None
session.flush()
a.id  # autoincremented integer

This is because autoflush does NOT auto fill the id (although a query of the object will, which sometimes can cause confusion as in “why this works here but not there?” But snapshoe already covered this part).


One related aspect that seems pretty important to me and wasn’t really mentioned:

Why would you not commit all the time? – The answer is atomicity.

A fancy word to say: an ensemble of operations have to all be executed successfully OR none of them will take effect.

For example, if you want to create/update/delete some object (A) and then create/update/delete another (B), but if (B) fails you want to revert (A). This means those 2 operations are atomic.

Therefore, if (B) needs a result of (A), you want to call flush after (A) and commit after (B).

Also, if session.autoflush is True, except for the case that I mentioned above or others in Jimbo‘s answer, you will not need to call flush manually.