问题:Python递归文件夹读取
我有C ++ / Obj-C背景,而我刚发现Python(大约写了一个小时)。我正在编写一个脚本,以递归方式读取文件夹结构中文本文件的内容。
我的问题是我编写的代码仅适用于一个文件夹较深的地方。我可以看到为什么在代码中(请参阅参考资料#hardcoded path
),我只是不知道如何继续使用Python,因为我的经验仅仅是全新的。
Python代码:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = rootdir + "/" + folder + "/py-outfile.txt" # hardcoded path
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = rootdir + '/' + file
f = open( filePath, 'r' )
toWrite = f.read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
f.close()
folderOut.close()
I have a C++/Obj-C background and I am just discovering Python (been writing it for about an hour).
I am writing a script to recursively read the contents of text files in a folder structure.
The problem I have is the code I have written will only work for one folder deep. I can see why in the code (see #hardcoded path
), I just don’t know how I can move forward with Python since my experience with it is only brand new.
Python Code:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = rootdir + "/" + folder + "/py-outfile.txt" # hardcoded path
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = rootdir + '/' + file
f = open( filePath, 'r' )
toWrite = f.read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
f.close()
folderOut.close()
回答 0
确保您了解以下三个返回值os.walk
:
for root, subdirs, files in os.walk(rootdir):
具有以下含义:
root
:“经过”的当前路径
subdirs
:root
目录类型中的文件
files
:目录中以外类型root
(不在中subdirs
)的文件
并且请使用os.path.join
而不是用斜杠连接!您的问题是filePath = rootdir + '/' + file
-您必须串联当前“步行”的文件夹,而不是最顶层的文件夹。所以一定是filePath = os.path.join(root, file)
。顺便说一句,“文件”是内置的,因此通常不将其用作变量名。
另一个问题是循环,应该像这样,例如:
import os
import sys
walk_dir = sys.argv[1]
print('walk_dir = ' + walk_dir)
# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))
for root, subdirs, files in os.walk(walk_dir):
print('--\nroot = ' + root)
list_file_path = os.path.join(root, 'my-directory-list.txt')
print('list_file_path = ' + list_file_path)
with open(list_file_path, 'wb') as list_file:
for subdir in subdirs:
print('\t- subdirectory ' + subdir)
for filename in files:
file_path = os.path.join(root, filename)
print('\t- file %s (full path: %s)' % (filename, file_path))
with open(file_path, 'rb') as f:
f_content = f.read()
list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
list_file.write(f_content)
list_file.write(b'\n')
如果您不知道,则with
文件声明是一种简写形式:
with open('filename', 'rb') as f:
dosomething()
# is effectively the same as
f = open('filename', 'rb')
try:
dosomething()
finally:
f.close()
Make sure you understand the three return values of os.walk
:
for root, subdirs, files in os.walk(rootdir):
has the following meaning:
root
: Current path which is “walked through”
subdirs
: Files in root
of type directory
files
: Files in root
(not in subdirs
) of type other than directory
And please use os.path.join
instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file
– you must concatenate the currently “walked” folder instead of the topmost folder. So that must be filePath = os.path.join(root, file)
. BTW “file” is a builtin, so you don’t normally use it as variable name.
Another problem are your loops, which should be like this, for example:
import os
import sys
walk_dir = sys.argv[1]
print('walk_dir = ' + walk_dir)
# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))
for root, subdirs, files in os.walk(walk_dir):
print('--\nroot = ' + root)
list_file_path = os.path.join(root, 'my-directory-list.txt')
print('list_file_path = ' + list_file_path)
with open(list_file_path, 'wb') as list_file:
for subdir in subdirs:
print('\t- subdirectory ' + subdir)
for filename in files:
file_path = os.path.join(root, filename)
print('\t- file %s (full path: %s)' % (filename, file_path))
with open(file_path, 'rb') as f:
f_content = f.read()
list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
list_file.write(f_content)
list_file.write(b'\n')
If you didn’t know, the with
statement for files is a shorthand:
with open('filename', 'rb') as f:
dosomething()
# is effectively the same as
f = open('filename', 'rb')
try:
dosomething()
finally:
f.close()
回答 1
如果您使用的是Python 3.5或更高版本,则可以在1行中完成此操作。
import glob
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
print(filename)
如文档中所述
如果递归为true,则模式**将匹配任何文件以及零个或多个目录和子目录。
如果需要每个文件,可以使用
import glob
for filename in glob.iglob(root_dir + '**/*', recursive=True):
print(filename)
If you are using Python 3.5 or above, you can get this done in 1 line.
import glob
# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
print(filename)
As mentioned in the documentation
If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories.
If you want every file, you can use
import glob
for filename in glob.iglob(root_dir + '**/**', recursive=True):
print(filename)
回答 2
同意Dave Webb,os.walk
将为树中的每个目录生成一个项目。事实是,您不必在意subFolders
。
这样的代码应该工作:
import os
import sys
rootdir = sys.argv[1]
for folder, subs, files in os.walk(rootdir):
with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
for filename in files:
with open(os.path.join(folder, filename), 'r') as src:
dest.write(src.read())
Agree with Dave Webb, os.walk
will yield an item for each directory in the tree. Fact is, you just don’t have to care about subFolders
.
Code like this should work:
import os
import sys
rootdir = sys.argv[1]
for folder, subs, files in os.walk(rootdir):
with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
for filename in files:
with open(os.path.join(folder, filename), 'r') as src:
dest.write(src.read())
回答 3
TL; DR:这等效于find -type f
遍历以下所有文件夹(包括当前文件夹)中的所有文件:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
正如其他答案中已经提到的那样,答案os.walk()
是正确的,但是可以更好地解释它。很简单!让我们来看看这棵树:
docs/
└── doc1.odt
pics/
todo.txt
使用此代码:
for currentpath, folders, files in os.walk('.'):
print(currentpath)
这currentpath
是它正在查看的当前文件夹。这将输出:
.
./docs
./pics
因此它循环了3次,因为有3个文件夹:当前文件夹docs
,和pics
。在每个循环中,它将填充变量folders
以及files
所有文件夹和文件。让我们向他们展示:
for currentpath, folders, files in os.walk('.'):
print(currentpath, folders, files)
这向我们显示:
# currentpath folders files
. ['pics', 'docs'] ['todo.txt']
./pics [] []
./docs [] ['doc1.odt']
因此,在第一行中,我们看到我们在folder中.
,它包含两个文件夹即pics
和docs
,并且存在一个文件,即todo.txt
。您无需执行任何操作即可递归到那些文件夹中,因为如您所见,它会自动递归,并且只为您提供任何子文件夹中的文件。以及它的任何子文件夹(尽管示例中没有这些子文件夹)。
如果您只想遍历所有文件(等效于)find -type f
,则可以执行以下操作:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
输出:
./todo.txt
./docs/doc1.odt
TL;DR: This is the equivalent to find -type f
to go over all files in all folders below and including the current one:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
As already mentioned in other answers, os.walk()
is the answer, but it could be explained better. It’s quite simple! Let’s walk through this tree:
docs/
└── doc1.odt
pics/
todo.txt
With this code:
for currentpath, folders, files in os.walk('.'):
print(currentpath)
The currentpath
is the current folder it is looking at. This will output:
.
./docs
./pics
So it loops three times, because there are three folders: the current one, docs
, and pics
. In every loop, it fills the variables folders
and files
with all folders and files. Let’s show them:
for currentpath, folders, files in os.walk('.'):
print(currentpath, folders, files)
This shows us:
# currentpath folders files
. ['pics', 'docs'] ['todo.txt']
./pics [] []
./docs [] ['doc1.odt']
So in the first line, we see that we are in folder .
, that it contains two folders namely pics
and docs
, and that there is one file, namely todo.txt
. You don’t have to do anything to recurse into those folders, because as you see, it recurses automatically and just gives you the files in any subfolders. And any subfolders of that (though we don’t have those in the example).
If you just want to loop through all files, the equivalent of find -type f
, you can do this:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
This outputs:
./todo.txt
./docs/doc1.odt
回答 4
该pathlib
库非常适合处理文件。您可以Path
像这样对对象执行递归glob 。
from pathlib import Path
for elem in Path('/path/to/my/files').rglob('*.*'):
print(elem)
The pathlib
library is really great for working with files. You can do a recursive glob on a Path
object like so.
from pathlib import Path
for elem in Path('/path/to/my/files').rglob('*.*'):
print(elem)
回答 5
如果要给定目录下所有路径的平面列表(如find .
在shell中):
files = [
os.path.join(parent, name)
for (parent, subdirs, files) in os.walk(YOUR_DIRECTORY)
for name in files + subdirs
]
要仅在基本目录下包含文件的完整路径,请省略+ subdirs
。
If you want a flat list of all paths under a given dir (like find .
in the shell):
files = [
os.path.join(parent, name)
for (parent, subdirs, files) in os.walk(YOUR_DIRECTORY)
for name in files + subdirs
]
To only include full paths to files under the base dir, leave out + subdirs
.
回答 6
import glob
import os
root_dir = <root_dir_here>
for filename in glob.iglob(root_dir + '**/**', recursive=True):
if os.path.isfile(filename):
with open(filename,'r') as file:
print(file.read())
**/**
用于递归获取所有文件,包括directory
。
if os.path.isfile(filename)
用于检查filename
变量是file
还是directory
,如果它是文件,那么我们可以读取该文件。我在这里打印文件。
import glob
import os
root_dir = <root_dir_here>
for filename in glob.iglob(root_dir + '**/**', recursive=True):
if os.path.isfile(filename):
with open(filename,'r') as file:
print(file.read())
**/**
is used to get all files recursively including directory
.
if os.path.isfile(filename)
is used to check if filename
variable is file
or directory
, if it is file then we can read that file.
Here I am printing file.
回答 7
我发现以下是最简单的
from glob import glob
import os
files = [f for f in glob('rootdir/**', recursive=True) if os.path.isfile(f)]
使用glob('some/path/**', recursive=True)
获取所有文件,但还包括目录名称。添加if os.path.isfile(f)
条件仅将此列表过滤到现有文件
I’ve found the following to be the easiest
from glob import glob
import os
files = [f for f in glob('rootdir/**', recursive=True) if os.path.isfile(f)]
Using glob('some/path/**', recursive=True)
gets all files, but also includes directory names. Adding the if os.path.isfile(f)
condition filters this list to existing files only
回答 8
用于os.path.join()
构建路径-更整洁:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = os.path.join(root,folder,"py-outfile.txt")
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = os.path.join(root,file)
toWrite = open( filePath).read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
folderOut.close()
use os.path.join()
to construct your paths – It’s neater:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = os.path.join(root,folder,"py-outfile.txt")
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = os.path.join(root,file)
toWrite = open( filePath).read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
folderOut.close()
回答 9
os.walk
默认情况下不会递归遍历。对于每个目录,从根目录开始都会生成一个三元组(目录路径,目录名,文件名)
from os import walk
from os.path import splitext, join
def select_files(root, files):
"""
simple logic here to filter out interesting files
.py files in this example
"""
selected_files = []
for file in files:
#do concatenation here to get full path
full_path = join(root, file)
ext = splitext(file)[1]
if ext == ".py":
selected_files.append(full_path)
return selected_files
def build_recursive_dir_tree(path):
"""
path - where to begin folder scan
"""
selected_files = []
for root, dirs, files in walk(path):
selected_files += select_files(root, files)
return selected_files
os.walk
does recursive walk by default. For each dir, starting from root it yields a 3-tuple (dirpath, dirnames, filenames)
from os import walk
from os.path import splitext, join
def select_files(root, files):
"""
simple logic here to filter out interesting files
.py files in this example
"""
selected_files = []
for file in files:
#do concatenation here to get full path
full_path = join(root, file)
ext = splitext(file)[1]
if ext == ".py":
selected_files.append(full_path)
return selected_files
def build_recursive_dir_tree(path):
"""
path - where to begin folder scan
"""
selected_files = []
for root, dirs, files in walk(path):
selected_files += select_files(root, files)
return selected_files
回答 10
试试这个:
import os
import sys
for root, subdirs, files in os.walk(path):
for file in os.listdir(root):
filePath = os.path.join(root, file)
if os.path.isdir(filePath):
pass
else:
f = open (filePath, 'r')
# Do Stuff
Try this:
import os
import sys
for root, subdirs, files in os.walk(path):
for file in os.listdir(root):
filePath = os.path.join(root, file)
if os.path.isdir(filePath):
pass
else:
f = open (filePath, 'r')
# Do Stuff
回答 11
我认为问题是您没有在处理 os.walk
正确。
首先,更改:
filePath = rootdir + '/' + file
至:
filePath = root + '/' + file
rootdir
是您的固定起始目录;root
是由返回的目录os.walk
。
其次,您不需要缩进文件处理循环,因为对每个子目录运行此循环都没有意义。您将root
设置到每个子目录。除非您要对目录本身进行某些操作,否则无需手动处理子目录。
I think the problem is that you’re not processing the output of os.walk
correctly.
Firstly, change:
filePath = rootdir + '/' + file
to:
filePath = root + '/' + file
rootdir
is your fixed starting directory; root
is a directory returned by os.walk
.
Secondly, you don’t need to indent your file processing loop, as it makes no sense to run this for each subdirectory. You’ll get root
set to each subdirectory. You don’t need to process the subdirectories by hand unless you want to do something with the directories themselves.