问题:递归子文件夹搜索并返回列表python中的文件
我正在编写一个脚本,以递归地遍历主文件夹中的子文件夹并根据某种文件类型构建一个列表。我的脚本有问题。当前设置如下
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,subFolder,item))
问题是subFolder变量正在拉入子文件夹列表,而不是ITEM文件所在的文件夹。我曾考虑过为子文件夹运行一个for循环,然后加入路径的第一部分,但我想出了ID双重检查以了解在此之前是否有人提出建议。谢谢你的帮助!
I am working on a script to recursively go through subfolders in a mainfolder and build a list off a certain file type. I am having an issue with the script. Its currently set as follows
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,subFolder,item))
the problem is that the subFolder variable is pulling in a list of subfolders rather than the folder that the ITEM file is located. I was thinking of running a for loop for the subfolder before and join the first part of the path but I figured Id double check to see if anyone has any suggestions before that. Thanks for your help!
回答 0
您应该使用dirpath
调用的root
。在dirnames
供给这样你就可以进行清理,如果有文件夹,你不想os.walk
递归到。
import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']
编辑:
经过最新一轮的投票之后,我发现这glob
是一个用于扩展选择的更好工具。
import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
也是生成器版本
from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))
适用于Python的Edit2 3.4+
from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
You should be using the dirpath
which you call root
. The dirnames
are supplied so you can prune it if there are folders that you don’t wish os.walk
to recurse into.
import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']
Edit:
After the latest downvote, it occurred to me that glob
is a better tool for selecting by extension.
import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Also a generator version
from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))
Edit2 for Python 3.4+
from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
回答 1
在Python 3.5中进行了更改:支持使用“ **”的递归glob。
glob.glob()
有一个新的递归参数。
如果要获取下面的每个.txt
文件my_path
(递归包括子目录):
import glob
files = glob.glob(my_path + '/**/*.txt', recursive=True)
# my_path/ the dir
# **/ every file and dir under my_path
# *.txt every file that ends with '.txt'
如果需要迭代器,可以使用iglob作为替代方案:
for file in glob.iglob(my_path, recursive=False):
# ...
Changed in Python 3.5: Support for recursive globs using “**”.
glob.glob()
got a new recursive parameter.
If you want to get every .txt
file under my_path
(recursively including subdirs):
import glob
files = glob.glob(my_path + '/**/*.txt', recursive=True)
# my_path/ the dir
# **/ every file and dir under my_path
# *.txt every file that ends with '.txt'
If you need an iterator you can use iglob as an alternative:
for file in glob.iglob(my_path, recursive=False):
# ...
回答 2
我将把John La Rooy的列表理解理解为nest for的表达,以防万一其他人难以理解它。
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
应等效于:
import glob
result = []
for x in os.walk(PATH):
for y in glob.glob(os.path.join(x[0], '*.txt')):
result.append(y)
这是列表理解和os.walk和glob.glob函数的文档。
I will translate John La Rooy’s list comprehension to nested for’s, just in case anyone else has trouble understanding it.
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Should be equivalent to:
import glob
result = []
for x in os.walk(PATH):
for y in glob.glob(os.path.join(x[0], '*.txt')):
result.append(y)
Here’s the documentation for list comprehension and the functions os.walk and glob.glob.
回答 3
这似乎是我能想到的最快的解决方案,并且比任何解决方案都快os.walk
而且要快得多glob
。
- 它也将免费为您提供所有嵌套子文件夹的列表。
- 您可以搜索几个不同的扩展名。
- 您也可以选择通过改变返回的文件不管是完整路径或只是名称
f.path
到f.name
(不改变它的子文件夹!)。
Args :dir: str, ext: list
。
函数返回两个列表:subfolders, files
。
有关详细的速度分析,请参见下文。
def run_fast_scandir(dir, ext): # dir: str, ext: list
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
subfolders, files = run_fast_scandir(folder, [".jpg"])
速度分析
各种方法来获取所有子文件夹和主文件夹中具有特定文件扩展名的所有文件。
tl; dr:
– fast_scandir
显然是获胜的,并且是除os.walk之外所有其他解决方案的两倍。
– os.walk
落后第二名。
-使用glob
会大大减慢该过程。
-没有任何结果使用自然排序。这意味着将对结果进行如下排序:1、10、2。要进行自然排序(1、2、10),请查看https://stackoverflow.com/a/48030307/2441026
结果:
fast_scandir took 499 ms. Found files: 16596. Found subfolders: 439
os.walk took 589 ms. Found files: 16596
find_files took 919 ms. Found files: 16596
glob.iglob took 998 ms. Found files: 16596
glob.glob took 1002 ms. Found files: 16596
pathlib.rglob took 1041 ms. Found files: 16596
os.walk-glob took 1043 ms. Found files: 16596
测试使用W7x64,Python 3.8.1和20次运行完成。439个(部分嵌套的)子文件夹中的16596个文件。
find_files
来自https://stackoverflow.com/a/45646357/2441026,可让您搜索多个扩展名。
fast_scandir
由我自己编写,并且还将返回子文件夹列表。您可以给它提供扩展名列表以进行搜索(我测试了一个列表,其中一个条目很简单,if ... == ".jpg"
并且没有显着差异)。
# -*- coding: utf-8 -*-
# Python 3
import time
import os
from glob import glob, iglob
from pathlib import Path
directory = r"<folder>"
RUNS = 20
def run_os_walk():
a = time.time_ns()
for i in range(RUNS):
fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
os.path.splitext(f)[1].lower() == '.jpg']
print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_os_walk_glob():
a = time.time_ns()
for i in range(RUNS):
fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_glob():
a = time.time_ns()
for i in range(RUNS):
fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_iglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_pathlib_rglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(Path(directory).rglob("*.jpg"))
print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def find_files(files, dirs=[], extensions=[]):
# https://stackoverflow.com/a/45646357/2441026
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1].lower() in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
def run_fast_scandir(dir, ext): # dir: str, ext: list
# https://stackoverflow.com/a/59803793/2441026
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
if __name__ == '__main__':
run_os_walk()
run_os_walk_glob()
run_glob()
run_iglob()
run_pathlib_rglob()
a = time.time_ns()
for i in range(RUNS):
files = []
find_files(files, dirs=[directory], extensions=[".jpg"])
print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")
a = time.time_ns()
for i in range(RUNS):
subf, files = run_fast_scandir(directory, [".jpg"])
print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")
This seems to be the fastest solution I could come up with, and is faster than os.walk
and a lot faster than any glob
solution.
- It will also give you a list of all nested subfolders at basically no cost.
- You can search for several different extensions.
- You can also choose to return either full paths or just the names for the files by changing
f.path
to f.name
(do not change it for subfolders!).
Args: dir: str, ext: list
.
Function returns two lists: subfolders, files
.
See below for a detailed speed anaylsis.
def run_fast_scandir(dir, ext): # dir: str, ext: list
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
subfolders, files = run_fast_scandir(folder, [".jpg"])
Speed analysis
for various methods to get all files with a specific file extension inside all subfolders and the main folder.
tl;dr:
– fast_scandir
clearly wins and is twice as fast as all other solutions, except os.walk.
– os.walk
is second place slighly slower.
– using glob
will greatly slow down the process.
– None of the results use natural sorting. This means results will be sorted like this: 1, 10, 2. To get natural sorting (1, 2, 10), please have a look at https://stackoverflow.com/a/48030307/2441026
Results:
fast_scandir took 499 ms. Found files: 16596. Found subfolders: 439
os.walk took 589 ms. Found files: 16596
find_files took 919 ms. Found files: 16596
glob.iglob took 998 ms. Found files: 16596
glob.glob took 1002 ms. Found files: 16596
pathlib.rglob took 1041 ms. Found files: 16596
os.walk-glob took 1043 ms. Found files: 16596
Tests were done with W7x64, Python 3.8.1, 20 runs. 16596 files in 439 (partially nested) subfolders.
find_files
is from https://stackoverflow.com/a/45646357/2441026 and lets you search for several extensions.
fast_scandir
was written by myself and will also return a list of subfolders. You can give it a list of extensions to search for (I tested a list with one entry to a simple if ... == ".jpg"
and there was no significant difference).
# -*- coding: utf-8 -*-
# Python 3
import time
import os
from glob import glob, iglob
from pathlib import Path
directory = r"<folder>"
RUNS = 20
def run_os_walk():
a = time.time_ns()
for i in range(RUNS):
fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
os.path.splitext(f)[1].lower() == '.jpg']
print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_os_walk_glob():
a = time.time_ns()
for i in range(RUNS):
fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_glob():
a = time.time_ns()
for i in range(RUNS):
fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_iglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_pathlib_rglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(Path(directory).rglob("*.jpg"))
print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def find_files(files, dirs=[], extensions=[]):
# https://stackoverflow.com/a/45646357/2441026
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1].lower() in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
def run_fast_scandir(dir, ext): # dir: str, ext: list
# https://stackoverflow.com/a/59803793/2441026
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
if __name__ == '__main__':
run_os_walk()
run_os_walk_glob()
run_glob()
run_iglob()
run_pathlib_rglob()
a = time.time_ns()
for i in range(RUNS):
files = []
find_files(files, dirs=[directory], extensions=[".jpg"])
print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")
a = time.time_ns()
for i in range(RUNS):
subf, files = run_fast_scandir(directory, [".jpg"])
print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")
回答 4
新的pathlib
库将其简化为一行:
from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))
您还可以使用生成器版本:
from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
pass
这将返回Path
对象,您几乎可以将其用于任何对象,或者通过获取字符串形式的文件名file.name
。
The new pathlib
library simplifies this to one line:
from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))
You can also use the generator version:
from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
pass
This returns Path
objects, which you can use for pretty much anything, or get the file name as a string by file.name
.
回答 5
它不是最Python的答案,但我会把它放在这里很有趣,因为这是递归中的一个很好的类
def find_files( files, dirs=[], extensions=[]):
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1] in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
在我的机器上,我有两个文件夹,root
并且root2
mender@multivax ]ls -R root root2
root:
temp1 temp2
root/temp1:
temp1.1 temp1.2
root/temp1/temp1.1:
f1.mid
root/temp1/temp1.2:
f.mi f.mid
root/temp2:
tmp.mid
root2:
dummie.txt temp3
root2/temp3:
song.mid
比方说,我想找到所有.txt
和所有.mid
文件在任何这些目录中,然后我可以做
files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)
#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']
Its not the most pythonic answer, but I’ll put it here for fun because it’s a neat lesson in recursion
def find_files( files, dirs=[], extensions=[]):
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1] in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
On my machine I have two folders, root
and root2
mender@multivax ]ls -R root root2
root:
temp1 temp2
root/temp1:
temp1.1 temp1.2
root/temp1/temp1.1:
f1.mid
root/temp1/temp1.2:
f.mi f.mid
root/temp2:
tmp.mid
root2:
dummie.txt temp3
root2/temp3:
song.mid
Lets say I want to find all .txt
and all .mid
files in either of these directories, then I can just do
files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)
#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']
回答 6
递归是Python 3.5中的新增功能,因此它不适用于Python 2.7。这是使用r
字符串的示例,因此您只需按Win,Lin,…上的路径提供路径即可。
import glob
mypath=r"C:\Users\dj\Desktop\nba"
files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
print(f) # nice looking single line per file
注意:它将列出所有文件,无论文件应多深。
Recursive is new in Python 3.5, so it won’t work on Python 2.7. Here is the example that uses r
strings so you just need to provide the path as is on either Win, Lin, …
import glob
mypath=r"C:\Users\dj\Desktop\nba"
files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
print(f) # nice looking single line per file
Note: It will list all files, no matter how deep it should go.
回答 7
您可以通过这种方式返回绝对路径文件列表。
def list_files_recursive(path):
"""
Function that receives as a parameter a directory path
:return list_: File List and Its Absolute Paths
"""
import os
files = []
# r = root, d = directories, f = files
for r, d, f in os.walk(path):
for file in f:
files.append(os.path.join(r, file))
lst = [file for file in files]
return lst
if __name__ == '__main__':
result = list_files_recursive('/tmp')
print(result)
You can do it this way to return you a list of absolute path files.
def list_files_recursive(path):
"""
Function that receives as a parameter a directory path
:return list_: File List and Its Absolute Paths
"""
import os
files = []
# r = root, d = directories, f = files
for r, d, f in os.walk(path):
for file in f:
files.append(os.path.join(r, file))
lst = [file for file in files]
return lst
if __name__ == '__main__':
result = list_files_recursive('/tmp')
print(result)
回答 8
如果您不介意安装其他灯库,则可以执行以下操作:
pip install plazy
用法:
import plazy
txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)
结果应如下所示:
['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']
它同时适用于Python 2.7和Python 3。
GitHub:https : //github.com/kyzas/plazy#list-files
免责声明:我是的作者plazy
。
If you don’t mind installing an additional light library, you can do this:
pip install plazy
Usage:
import plazy
txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)
The result should look something like this:
['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']
It works on both Python 2.7 and Python 3.
Github: https://github.com/kyzas/plazy#list-files
Disclaimer: I’m an author of plazy
.
回答 9
此功能将递归地仅将文件放入列表中。希望你能。
import os
def ls_files(dir):
files = list()
for item in os.listdir(dir):
abspath = os.path.join(dir, item)
try:
if os.path.isdir(abspath):
files = files + ls_files(abspath)
else:
files.append(abspath)
except FileNotFoundError as err:
print('invalid directory\n', 'Error: ', err)
return files
This function will recursively put only files into a list.
import os
def ls_files(dir):
files = list()
for item in os.listdir(dir):
abspath = os.path.join(dir, item)
try:
if os.path.isdir(abspath):
files = files + ls_files(abspath)
else:
files.append(abspath)
except FileNotFoundError as err:
print('invalid directory\n', 'Error: ', err)
return files
回答 10
您最初的解决方案几乎是正确的,但是变量“ root”是在递归路径周围动态更新的。os.walk()是一个递归生成器。每个元组集(根,subFolder,文件)都是按照设置方式针对特定根的。
即
root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]
root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]
...
我对您的代码做了些微调整,以打印完整列表。
import os
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,item))
print(fileNamePath)
希望这可以帮助!
Your original solution was very nearly correct, but the variable “root” is dynamically updated as it recursively paths around. os.walk() is a recursive generator. Each tuple set of (root, subFolder, files) is for a specific root the way you have it setup.
i.e.
root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]
root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]
...
I made a slight tweak to your code to print a full list.
import os
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,item))
print(fileNamePath)
Hope this helps!