如何使用glob.glob模块搜索子文件夹?

问题:如何使用glob.glob模块搜索子文件夹?

我想在文件夹中打开一系列子文件夹,然后找到一些文本文件并打印一些文本文件行。我正在使用这个:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')

但这也无法访问子文件夹。有谁知道我也可以使用相同的命令来访问子文件夹?

I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')

But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?


回答 0

在Python 3.5及更高版本中,使用新的递归**/功能:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)

recursive被设置时,**随后是路径分隔匹配0或多个子目录。

在早期的Python版本中,glob.glob()无法递归列出子目录中的文件。

在这种情况下,我将改用os.walk()结合fnmatch.filter()

import os
import fnmatch

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in fnmatch.filter(files, '*.txt')]

这将递归遍历您的目录,并将所有绝对路径名返回到匹配.txt文件。在这种特定情况下,fnmatch.filter()可能是矫kill过正,您也可以使用.endswith()测试:

import os

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in files if f.endswith('.txt')]

In Python 3.5 and newer use the new recursive **/ functionality:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)

When recursive is set, ** followed by a path separator matches 0 or more subdirectories.

In earlier Python versions, glob.glob() cannot list files in subdirectories recursively.

In that case I’d use os.walk() combined with fnmatch.filter() instead:

import os
import fnmatch

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in fnmatch.filter(files, '*.txt')]

This’ll walk your directories recursively and return all absolute pathnames to matching .txt files. In this specific case the fnmatch.filter() may be overkill, you could also use a .endswith() test:

import os

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in files if f.endswith('.txt')]

回答 1

要在直接子目录中查找文件:

configfiles = glob.glob(r'C:\Users\sam\Desktop\*\*.txt')

对于遍历所有子目录的递归版本,您可以使用**和传递recursive=True 自Python 3.5之后的版本

configfiles = glob.glob(r'C:\Users\sam\Desktop\**\*.txt', recursive=True)

这两个函数调用都返回列表。您可以用来glob.iglob()一一返回路径。或使用pathlib

from pathlib import Path

path = Path(r'C:\Users\sam\Desktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir

两种方法都返回迭代器(您可以一一获取路径)。

To find files in immediate subdirectories:

configfiles = glob.glob(r'C:\Users\sam\Desktop\*\*.txt')

For a recursive version that traverse all subdirectories, you could use ** and pass recursive=True since Python 3.5:

configfiles = glob.glob(r'C:\Users\sam\Desktop\**\*.txt', recursive=True)

Both function calls return lists. You could use glob.iglob() to return paths one by one. Or use pathlib:

from pathlib import Path

path = Path(r'C:\Users\sam\Desktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir

Both methods return iterators (you can get paths one by one).


回答 2

在这个话题上有很多困惑。让我看看是否可以澄清它(Python 3.7):

  1. glob.glob('*.txt') :匹配当前目录中所有以“ .txt”结尾的文件
  2. glob.glob('*/*.txt') :与1相同
  3. glob.glob('**/*.txt') :仅匹配直接子目录中所有以’.txt’结尾的文件,而不匹配当前目录中的所有文件
  4. glob.glob('*.txt',recursive=True) :与1相同
  5. glob.glob('*/*.txt',recursive=True) :与3相同
  6. glob.glob('**/*.txt',recursive=True):匹配当前目录和所有子目录中所有以“ .txt”结尾的文件

所以最好总是指定 recursive=True.

There’s a lot of confusion on this topic. Let me see if I can clarify it (Python 3.7):

  1. glob.glob('*.txt') :matches all files ending in ‘.txt’ in current directory
  2. glob.glob('*/*.txt') :same as 1
  3. glob.glob('**/*.txt') :matches all files ending in ‘.txt’ in the immediate subdirectories only, but not in the current directory
  4. glob.glob('*.txt',recursive=True) :same as 1
  5. glob.glob('*/*.txt',recursive=True) :same as 3
  6. glob.glob('**/*.txt',recursive=True):matches all files ending in ‘.txt’ in the current directory and in all subdirectories

So it’s best to always specify recursive=True.


回答 3

glob2包支持通配符和相当快

code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)

在我的笔记本电脑上,匹配> 60,000个文件路径大约需要2秒钟。

The glob2 package supports wild cards and is reasonably fast

code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)

On my laptop it takes approximately 2 seconds to match >60,000 file paths.


回答 4

您可以在Python 2.6中使用Formic

import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")

披露-我是该软件包的作者。

You can use Formic with Python 2.6

import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")

Disclosure – I am the author of this package.


回答 5

这是改编版,glob.glob无需使用即可启用类似功能glob2

def find_files(directory, pattern='*'):
    if not os.path.exists(directory):
        raise ValueError("Directory not found {}".format(directory))

    matches = []
    for root, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            full_path = os.path.join(root, filename)
            if fnmatch.filter([full_path], pattern):
                matches.append(os.path.join(root, filename))
    return matches

因此,如果您具有以下目录结构

tests/files
├── a0
   ├── a0.txt
   ├── a0.yaml
   └── b0
       ├── b0.yaml
       └── b00.yaml
└── a1

你可以做这样的事情

files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']

几乎fnmatch对整个文件名本身模式匹配,而不只是文件名。

Here is a adapted version that enables glob.glob like functionality without using glob2.

def find_files(directory, pattern='*'):
    if not os.path.exists(directory):
        raise ValueError("Directory not found {}".format(directory))

    matches = []
    for root, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            full_path = os.path.join(root, filename)
            if fnmatch.filter([full_path], pattern):
                matches.append(os.path.join(root, filename))
    return matches

So if you have the following dir structure

tests/files
├── a0
│   ├── a0.txt
│   ├── a0.yaml
│   └── b0
│       ├── b0.yaml
│       └── b00.yaml
└── a1

You can do something like this

files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']

Pretty much fnmatch pattern match on the whole filename itself, rather than the filename only.


回答 6

configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")

并非在所有情况下都适用,请改用glob2

configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")

configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")

Doesn’t works for all cases, instead use glob2

configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")

回答 7

如果可以安装glob2软件包…

import glob2
filenames = glob2.glob("C:\\top_directory\\**\\*.ext")  # Where ext is a specific file extension
folders = glob2.glob("C:\\top_directory\\**\\")

所有文件名和文件夹:

all_ff = glob2.glob("C:\\top_directory\\**\\**")  

If you can install glob2 package…

import glob2
filenames = glob2.glob("C:\\top_directory\\**\\*.ext")  # Where ext is a specific file extension
folders = glob2.glob("C:\\top_directory\\**\\")

All filenames and folders:

all_ff = glob2.glob("C:\\top_directory\\**\\**")  

回答 8

如果您运行的是Python 3.4+,则可以使用该pathlib模块。该Path.glob()方法支持**模式,即“递归该目录和所有子目录”。它返回一个生成器,生成Path所有匹配文件的对象。

from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")

If you’re running Python 3.4+, you can use the pathlib module. The Path.glob() method supports the ** pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path objects for all matching files.

from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")

回答 9

正如Martijn所指出的,glob只能通过**Python 3.5中引入的运算符来做到这一点。由于OP明确要求使用glob模块,因此以下代码将返回行为类似的惰性评估迭代器

import os, glob, itertools

configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
                         for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))

请注意,configfiles尽管如此,您只能在此方法中重复一次。如果您需要可在多个操作中使用的配置文件的真实列表,则必须使用创建显式的配置文件list(configfiles)

As pointed out by Martijn, glob can only do this through the **operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly

import os, glob, itertools

configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
                         for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))

Note that you can only iterate once over configfiles in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles).


回答 10

该命令rglob将对目录结构的最深子级别进行无限递归。如果您只想深一层,则不要使用它。

我意识到OP正在谈论使用glob.glob。我相信,这可以回答意图,即递归搜索所有子文件夹。

rglob函数最近使数据处理算法的速度提高了100倍,该算法使用文件夹结构作为数据读取顺序的固定假设。但是,由于rglob我们能够对指定父目录或该目录下的所有文件进行一次扫描,将它们的名称保存到列表(超过一百万个文件),然后使用该列表来确定我们需要在任何目录下打开哪些文件仅基于文件命名约定及其在哪个文件夹中指向将来。

The command rglob will do an infinite recursion down the deepest sub-level of your directory structure. If you only want one level deep, then do not use it, however.

I realize the OP was talking about using glob.glob. I believe this answers the intent, however, which is to search all subfolders recursively.

The rglob function recently produced a 100x increase in speed for a data processing algorithm which was using the folder structure as a fixed assumption for the order of data reading. However, with rglob we were able to do a single scan once through all files at or below a specified parent directory, save their names to a list (over a million files), then use that list to determine which files we needed to open at any point in the future based on the file naming conventions only vs. which folder they were in.