排除os.walk中的目录

问题:排除os.walk中的目录

我正在编写一个脚本,该脚本会下降到目录树中(使用os.walk()),然后访问与某个文件扩展名匹配的每个文件。但是,由于我的工具将使用的某些目录树还包含子目录,这些子目录又包含很多无用的(出于此脚本的目的)内容,因此我想为用户添加一个选项以指定从遍历中排除的目录列表。

使用os.walk()很简单。毕竟,由我决定是我实际上是要访问os.walk()生成的相应文件/目录还是仅跳过它们。问题是,如果我有例如这样的目录树:

root--
     |
     --- dirA
     |
     --- dirB
     |
     --- uselessStuff --
                       |
                       --- moreJunk
                       |
                       --- yetMoreJunk

而且我想排除uselessStuff及其所有子项,os.walk()仍会进入uselessStuff的所有(可能成千上万个)子目录中,不用说,这会大大降低速度。在理想的世界中,我可以告诉os.walk()甚至不必费心再产生更多的uselessStuff,但是据我所知,这是没有办法的(是吗?)。

有人有主意吗?也许有一个第三方库提供了类似的东西?

I’m writing a script that descends into a directory tree (using os.walk()) and then visits each file matching a certain file extension. However, since some of the directory trees that my tool will be used on also contain sub directories that in turn contain a LOT of useless (for the purpose of this script) stuff, I figured I’d add an option for the user to specify a list of directories to exclude from the traversal.

This is easy enough with os.walk(). After all, it’s up to me to decide whether I actually want to visit the respective files / dirs yielded by os.walk() or just skip them. The problem is that if I have, for example, a directory tree like this:

root--
     |
     --- dirA
     |
     --- dirB
     |
     --- uselessStuff --
                       |
                       --- moreJunk
                       |
                       --- yetMoreJunk

and I want to exclude uselessStuff and all its children, os.walk() will still descend into all the (potentially thousands of) sub directories of uselessStuff, which, needless to say, slows things down a lot. In an ideal world, I could tell os.walk() to not even bother yielding any more children of uselessStuff, but to my knowledge there is no way of doing that (is there?).

Does anyone have an idea? Maybe there’s a third-party library that provides something like that?


回答 0

dirs 就地修改将修剪(以下)访问的(后续)文件和目录os.walk

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

从帮助(os.walk):

当topdown为true时,调用者可以就地修改目录名称列表(例如,通过del或slice分配),而walk仅会递归到名称保留在目录名称中的子目录中;这可以用来修剪搜索…

Modifying dirs in-place will prune the (subsequent) files and directories visited by os.walk:

# exclude = set(['New folder', 'Windows', 'Desktop'])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

From help(os.walk):

When topdown is true, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search…


回答 1

… @ unutbu出色答案的另一种形式,它的读入更为直接,因为其目的是排除目录,所花费的时间为O(n ** 2)vs O(n)。

list(dirs)为了正确执行,需要复制dirs列表)

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    [dirs.remove(d) for d in list(dirs) if d in exclude]

… an alternative form of @unutbu’s excellent answer that reads a little more directly, given that the intent is to exclude directories, at the cost of O(n**2) vs O(n) time.

(Making a copy of the dirs list with list(dirs) is required for correct execution)

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    [dirs.remove(d) for d in list(dirs) if d in exclude]