在python中查找文件

问题:在python中查找文件

我有一个文件,该文件可能位于每个用户计算机上的不同位置。有没有一种方法可以实现文件搜索?我可以传递文件名和目录树进行搜索的方式吗?

I have a file that may be in a different place on each user’s machine. Is there a way to implement a search for the file? A way that I can pass the file’s name and the directory tree to search in?


回答 0

os.walk是答案,它将找到第一个匹配项:

import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

这将找到所有匹配项:

def find_all(name, path):
    result = []
    for root, dirs, files in os.walk(path):
        if name in files:
            result.append(os.path.join(root, name))
    return result

这将匹配一个模式:

import os, fnmatch
def find(pattern, path):
    result = []
    for root, dirs, files in os.walk(path):
        for name in files:
            if fnmatch.fnmatch(name, pattern):
                result.append(os.path.join(root, name))
    return result

find('*.txt', '/path/to/dir')

os.walk is the answer, this will find the first match:

import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

And this will find all matches:

def find_all(name, path):
    result = []
    for root, dirs, files in os.walk(path):
        if name in files:
            result.append(os.path.join(root, name))
    return result

And this will match a pattern:

import os, fnmatch
def find(pattern, path):
    result = []
    for root, dirs, files in os.walk(path):
        for name in files:
            if fnmatch.fnmatch(name, pattern):
                result.append(os.path.join(root, name))
    return result

find('*.txt', '/path/to/dir')

回答 1

我使用的版本,os.walk并且在较大的目录上获取时间约为3.5秒。我尝试了两个随机解决方案,但并没有太大的改进,然后就做了:

paths = [line[2:] for line in subprocess.check_output("find . -iname '*.txt'", shell=True).splitlines()]

虽然它仅适用于POSIX,但我只有0.25秒。

因此,我相信完全有可能以与平台无关的方式对整个搜索进行优化,但这是我停止研究的地方。

I used a version of os.walk and on a larger directory got times around 3.5 sec. I tried two random solutions with no great improvement, then just did:

paths = [line[2:] for line in subprocess.check_output("find . -iname '*.txt'", shell=True).splitlines()]

While it’s POSIX-only, I got 0.25 sec.

From this, I believe it’s entirely possible to optimise whole searching a lot in a platform-independent way, but this is where I stopped the research.


回答 2

如果您在Ubuntu上使用Python,而只希望它在Ubuntu上运行,则使用这样的终端locate程序的方法要快得多。

import subprocess

def find_files(file_name):
    command = ['locate', file_name]

    output = subprocess.Popen(command, stdout=subprocess.PIPE).communicate()[0]
    output = output.decode()

    search_results = output.split('\n')

    return search_results

search_resultslist绝对文件路径中的一个。这比上述方法快10,000倍,而我做过一次搜索,快了约72,000倍。

If you are using Python on Ubuntu and you only want it to work on Ubuntu a substantially faster way is the use the terminal’s locate program like this.

import subprocess

def find_files(file_name):
    command = ['locate', file_name]

    output = subprocess.Popen(command, stdout=subprocess.PIPE).communicate()[0]
    output = output.decode()

    search_results = output.split('\n')

    return search_results

search_results is a list of the absolute file paths. This is 10,000’s of times faster than the methods above and for one search I’ve done it was ~72,000 times faster.


回答 3

在Python 3.4或更高版本中,您可以使用pathlib进行递归遍历:

>>> import pathlib
>>> sorted(pathlib.Path('.').glob('**/*.py'))
[PosixPath('build/lib/pathlib.py'),
 PosixPath('docs/conf.py'),
 PosixPath('pathlib.py'),
 PosixPath('setup.py'),
 PosixPath('test_pathlib.py')]

参考:https : //docs.python.org/3/library/pathlib.html#pathlib.Path.glob

在Python 3.5或更高版本中,您还可以像这样进行递归glob:

>>> import glob
>>> glob.glob('**/*.txt', recursive=True)
['2.txt', 'sub/3.txt']

参考:https : //docs.python.org/3/library/glob.html#glob.glob

In Python 3.4 or newer you can use pathlib to do recursive globbing:

>>> import pathlib
>>> sorted(pathlib.Path('.').glob('**/*.py'))
[PosixPath('build/lib/pathlib.py'),
 PosixPath('docs/conf.py'),
 PosixPath('pathlib.py'),
 PosixPath('setup.py'),
 PosixPath('test_pathlib.py')]

Reference: https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob

In Python 3.5 or newer you can also do recursive globbing like this:

>>> import glob
>>> glob.glob('**/*.txt', recursive=True)
['2.txt', 'sub/3.txt']

Reference: https://docs.python.org/3/library/glob.html#glob.glob


回答 4

为了快速进行与操作系统无关的搜索,请使用 scandir

https://github.com/benhoyt/scandir/#readme

阅读http://bugs.python.org/issue11406了解详细信息。

For fast, OS-independent search, use scandir

https://github.com/benhoyt/scandir/#readme

Read http://bugs.python.org/issue11406 for details why.


回答 5

如果您使用的是Python 2,则由于自引用符号链接会导致Windows上的无限递归问题。

该脚本将避免遵循这些规则。请注意,这是特定于Windows的

import os
from scandir import scandir
import ctypes

def is_sym_link(path):
    # http://stackoverflow.com/a/35915819
    FILE_ATTRIBUTE_REPARSE_POINT = 0x0400
    return os.path.isdir(path) and (ctypes.windll.kernel32.GetFileAttributesW(unicode(path)) & FILE_ATTRIBUTE_REPARSE_POINT)

def find(base, filenames):
    hits = []

    def find_in_dir_subdir(direc):
        content = scandir(direc)
        for entry in content:
            if entry.name in filenames:
                hits.append(os.path.join(direc, entry.name))

            elif entry.is_dir() and not is_sym_link(os.path.join(direc, entry.name)):
                try:
                    find_in_dir_subdir(os.path.join(direc, entry.name))
                except UnicodeDecodeError:
                    print "Could not resolve " + os.path.join(direc, entry.name)
                    continue

    if not os.path.exists(base):
        return
    else:
        find_in_dir_subdir(base)

    return hits

它返回一个列表,其中包含指向文件名列表中文件的所有路径。用法:

find("C:\\", ["file1.abc", "file2.abc", "file3.abc", "file4.abc", "file5.abc"])

If you are working with Python 2 you have a problem with infinite recursion on windows caused by self-referring symlinks.

This script will avoid following those. Note that this is windows-specific!

import os
from scandir import scandir
import ctypes

def is_sym_link(path):
    # http://stackoverflow.com/a/35915819
    FILE_ATTRIBUTE_REPARSE_POINT = 0x0400
    return os.path.isdir(path) and (ctypes.windll.kernel32.GetFileAttributesW(unicode(path)) & FILE_ATTRIBUTE_REPARSE_POINT)

def find(base, filenames):
    hits = []

    def find_in_dir_subdir(direc):
        content = scandir(direc)
        for entry in content:
            if entry.name in filenames:
                hits.append(os.path.join(direc, entry.name))

            elif entry.is_dir() and not is_sym_link(os.path.join(direc, entry.name)):
                try:
                    find_in_dir_subdir(os.path.join(direc, entry.name))
                except UnicodeDecodeError:
                    print "Could not resolve " + os.path.join(direc, entry.name)
                    continue

    if not os.path.exists(base):
        return
    else:
        find_in_dir_subdir(base)

    return hits

It returns a list with all paths that point to files in the filenames list. Usage:

find("C:\\", ["file1.abc", "file2.abc", "file3.abc", "file4.abc", "file5.abc"])

回答 6

下面,我们使用布尔值“ first”在第一匹配项和所有匹配项之间切换(默认值等效于“ find。-name文件”):

import  os

def find(root, file, first=False):
    for d, subD, f in os.walk(root):
        if file in f:
            print("{0} : {1}".format(file, d))
            if first == True:
                break 

Below we use a boolean “first” argument to switch between first match and all matches (a default which is equivalent to “find . -name file”):

import  os

def find(root, file, first=False):
    for d, subD, f in os.walk(root):
        if file in f:
            print("{0} : {1}".format(file, d))
            if first == True:
                break 

回答 7

答案与现有答案非常相似,但略有优化。

因此,您可以按模式查找任何文件或文件夹:

def iter_all(pattern, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if pattern.match(entry)
    )

通过子字符串:

def iter_all(substring, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if substring in entry
    )

或使用谓词:

def iter_all(predicate, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if predicate(entry)
    )

仅搜索文件或仅搜索文件夹-例如,根据需要,将“ dirs + files”替换为“ dirs”或“ files”。

问候。

The answer is very similar to existing ones, but slightly optimized.

So you can find any files or folders by pattern:

def iter_all(pattern, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if pattern.match(entry)
    )

either by substring:

def iter_all(substring, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if substring in entry
    )

or using a predicate:

def iter_all(predicate, path):
    return (
        os.path.join(root, entry)
        for root, dirs, files in os.walk(path)
        for entry in dirs + files
        if predicate(entry)
    )

to search only files or only folders – replace “dirs + files”, for example, with only “dirs” or only “files”, depending on what you need.

Regards.


回答 8

SARose的答案对我一直有效,直到我从Ubuntu 20.04 LTS更新为止。我对他的代码进行的微小更改使其可以在最新的Ubuntu版本上使用。

import subprocess

def find_files(file_name):
    file_name = 'chromedriver'
    command = ['locate'+ ' ' + file_name]
    output = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True).communicate()[0]
    output = output.decode()
    search_results = output.split('\n')
    return search_results

SARose’s answer worked for me until I updated from Ubuntu 20.04 LTS. The slight change I made to his code makes it work on the latest Ubuntu release.

import subprocess

def find_files(file_name):
    command = ['locate'+ ' ' + file_name]
    output = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True).communicate()[0]
    output = output.decode()
    search_results = output.split('\n')
    return search_results