问题:如何从Python包内部读取(静态)文件?

您能告诉我如何读取Python包中的文件吗?

我的情况

我加载的程序包具有许多模板(要用作程序的文本文件),我想从程序中加载它们。但是,如何指定此类文件的路径?

想象一下我想从以下位置读取文件:

package\templates\temp_file

某种路径操纵?包基本路径跟踪?

Could you tell me how can I read a file that is inside my Python package?

My situation

A package that I load has a number of templates (text files used as strings) that I want to load from within the program. But how do I specify the path to such file?

Imagine I want to read a file from:

package\templates\temp_file

Some kind of path manipulation? Package base path tracking?


回答 0

[添加2016-06-15:显然,这并非在所有情况下都有效。请参阅其他答案]


import os, mypackage
template = os.path.join(mypackage.__path__[0], 'templates', 'temp_file')

[added 2016-06-15: apparently this doesn’t work in all situations. please refer to the other answers]


import os, mypackage
template = os.path.join(mypackage.__path__[0], 'templates', 'temp_file')

回答 1

TLDR;使用标准库的importlib.resources模块,如下面方法2中所述。

不再推荐使用传统的 pkg_resourcesfromsetuptools,因为新方法:

  • 它的性能明显更高 ;
  • 这样做比较安全,因为使用软件包(而不是路径)会引起编译时错误;
  • 它更直观,因为您不必“加入”路径;
  • 由于不需要额外的依赖项(setuptools),因此开发时速度更快,而仅依赖于Python的标准库。

我将传统列在第一位,以在移植现有代码时解释新方法的区别(此处解释了移植)。



假设您的模板位于模块包内嵌套的文件夹中:

  <your-package>
    +--<module-asking-the-file>
    +--templates/
          +--temp_file                         <-- We want this file.

注意1:当然,我们不应该摆弄这个__file__属性(例如,从zip投放时代码会中断)。

注意2:如果您要构建此程序包,请记住将中的数据文件隐藏起来setup.py

1)使用pkg_resourcessetuptools(慢)

您可以使用setuptools发行版中的pkg_resources软件包,但这会带来性能方面的成本

import pkg_resources

# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file'))  # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)

提示:

  • 这将读取的数据,即使您的分布压缩,所以你可以设置 zip_safe=True你的setup.py,和/或使用期待已久的zipapp打包机Python- 3.5打造自成体系的分布。

  • 记住要添加setuptools到您的运行时要求中(例如,在install_requires中)。

…,请注意,根据Setuptools / pkg_resourcesdocs,您不应使用os.path.join

基本资源访问

请注意,资源名称必须是- /分隔的路径,并且不能是绝对路径(即,没有前导/)或包含诸如“ ..”的相对名称。千万不能使用os.path程序来操作的资源路径,因为它们不是文件系统路径。

2)Python> = 3.7,或使用反向移植的importlib_resources

使用标准库的,该模块setuptools上面的效率更高:

try:
    import importlib.resources as pkg_resources
except ImportError:
    # Try backported to PY<37 `importlib_resources`.
    import importlib_resources as pkg_resources

from . import templates  # relative-import the *package* containing the templates

template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')

注意:

关于功能read_text(package, resource)

  • package可以是一个字符串或模块。
  • resource不再被一个路径,但资源开放,现有的包内的不仅是文件名; 它可能不包含路径分隔符,并且可能没有子资源(即它不能是目录)。

对于问题中提出的示例,我们现在必须:

  • <your_package>/templates/ 通过__init__.py在其中创建一个空文件,将其制作成适当的软件包,
  • 所以现在我们可以使用一个简单的(可能是相对的)import语句(不再解析包/模块名称),
  • 并索要resource_name = "temp_file"(没有路径)。

提示:

  • 要访问当前模块内部的文件,请将package参数设置为__package__,例如pkg_resources.read_text(__package__, 'temp_file')(感谢@ ben-mares)。
  • 当事情变得有趣的实际文件名被要求用path()的,因为现在用于临时创建的文件(阅读上下文经理这个)。
  • 添加回迁库,有条件地为老年人Python,用install_requires=[" importlib_resources ; python_version<'3.7'"](检查这个,如果你用打包项目setuptools<36.2.1)。
  • 如果从传统方法迁移,请记住setuptools运行时要求中删除库。
  • 记住要定制setup.pyMANIFEST包括任何静态文件
  • 您也可以zip_safe=True在中设置setup.py

TLDR; Use standard-library’s importlib.resources module as explained in the method no 2, below.

The traditional pkg_resources from setuptools is not recommended anymore because the new method:

  • it is significantly more performant;
  • is is safer since the use of packages (instead of path-stings) raises compile-time errors;
  • it is more intuitive because you don’t have to “join” paths;
  • it is faster when developing since you don’t need an extra dependency (setuptools), but rely on Python’s standard-library alone.

I kept the traditional listed first, to explain the differences with the new method when porting existing code (porting also explained here).



Let’s assume your templates are located in a folder nested inside your module’s package:

  <your-package>
    +--<module-asking-the-file>
    +--templates/
          +--temp_file                         <-- We want this file.

Note 1: For sure, we should NOT fiddle with the __file__ attribute (e.g. code will break when served from a zip).

Note 2: If you are building this package, remember to declatre your data files as in your setup.py.

1) Using pkg_resources from setuptools(slow)

You may use pkg_resources package from setuptools distribution, but that comes with a cost, performance-wise:

import pkg_resources

# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file'))  # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)

Tips:

  • This will read data even if your distribution is zipped, so you may set zip_safe=True in your setup.py, and/or use the long-awaited zipapp packer from python-3.5 to create self-contained distributions.

  • Remember to add setuptools into your run-time requirements (e.g. in install_requires`).

… and notice that according to the Setuptools/pkg_resources docs, you should not use os.path.join:

Basic Resource Access

Note that resource names must be /-separated paths and cannot be absolute (i.e. no leading /) or contain relative names like “..“. Do not use os.path routines to manipulate resource paths, as they are not filesystem paths.

2) Python >= 3.7, or using the backported importlib_resources library

Use the standard library’s which is more efficient than setuptools, above:

try:
    import importlib.resources as pkg_resources
except ImportError:
    # Try backported to PY<37 `importlib_resources`.
    import importlib_resources as pkg_resources

from . import templates  # relative-import the *package* containing the templates

template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')

Attention:

Regarding the function read_text(package, resource):

  • The package can be either a string or a module.
  • The resource is NOT a path anymore, but just the filename of the resource to open, within an existing package; it may not contain path separators and it may not have sub-resources (i.e. it cannot be a directory).

For the example asked in the question, we must now:

  • make the <your_package>/templates/ into a proper package, by creating an empty __init__.py file in it,
  • so now we can use a simple (possibly relative) import statement (no more parsing package/module names),
  • and simply ask for resource_name = "temp_file" (no path).

Tips:

  • To access a file inside the current module, set the package argument to __package__, e.g. pkg_resources.read_text(__package__, 'temp_file') (thanks to @ben-mares).
  • Things become interesting when an actual filename is asked with path(), since now context-managers are used for temporarily-created files (read this).
  • Add the backported library, conditionally for older Pythons, with install_requires=[" importlib_resources ; python_version<'3.7'"] (check this if you package your project with setuptools<36.2.1).
  • Remember to remove setuptools library from your runtime-requirements, if you migrated from the traditional method.
  • Remember to customize setup.py or MANIFEST to include any static files.
  • You may also set zip_safe=True in your setup.py.

回答 2

包装前奏:

在甚至不必担心读取资源文件之前,第一步就是要确保首先将数据文件打包到您的发行版中-可以很容易地直接从源代码树中读取它们,但重要的是确保可以从已安装的软件包中的代码访问这些资源文件。

这样构造项目,将数据文件放入包中的子目录

.
├── package
   ├── __init__.py
   ├── templates
      └── temp_file
   ├── mymodule1.py
   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py

你应该通过setup()呼叫。仅当您要使用setuptools / distutils并构建源分发版时,才需要清单文件。为了确保templates/temp_file此示例项目结构的打包内容得到打包,请在清单文件中添加如下一行:

recursive-include package *

历史记录注释: 对于 flit,poetry等现代构建后端不需要使用清单文件,默认情况下将包括包数据文件。因此,如果您正在使用pyproject.toml并且没有setup.py文件,则可以忽略有关的所有内容MANIFEST.in

现在,不用包装,放在阅读部分上…

建议:

使用标准库pkgutilAPI。在库代码中将如下所示:

# within package/mymodule1.py, for example
import pkgutil

data = pkgutil.get_data(__name__, "templates/temp_file")
print("data:", repr(data))
text = pkgutil.get_data(__name__, "templates/temp_file").decode()
print("text:", repr(text))

它可以使用拉链。它适用于Python 2和Python3。它不需要第三方依赖。我真的不知道有什么弊端(如果您愿意,请在答案上发表评论)。

避免的坏方法:

坏方法#1:使用源文件中的相对路径

这是目前公认的答案。充其量看起来像这样:

from pathlib import Path

resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()
print("data", repr(data))

怎么了 您拥有可用文件和子目录的假设是不正确的。如果执行打包在zip或wheel中的代码,则此方法不起作用,并且是否将包完全提取到文件系统中可能完全不受用户控制。

坏方法2:使用pkg_resources API

投票最多的答案对此进行了描述。看起来像这样:

from pkg_resources import resource_string

data = resource_string(__name__, "templates/temp_file")
print("data", repr(data))

怎么了 它在setuptools上添加了运行时依赖关系,最好仅是安装时间依赖关系。即使代码只对您自己的软件包资源感兴趣,导入和使用也会变得非常缓慢,因为代码会建立所有已安装软件包的工作集。在安装时这没什么大不了的(因为安装是一次性的),但是在运行时却很难看。pkg_resources

坏方法#3:使用importlib.resources API

目前,这是投票最多的答案中的建议。这是最近标准库的新增功能(Python 3.7中的新增功能),但是也有一个反向端口。看起来像这样:

try:
    from importlib.resources import read_binary
    from importlib.resources import read_text
except ImportError:
    # Python 2.x backport
    from importlib_resources import read_binary
    from importlib_resources import read_text

data = read_binary("package.templates", "temp_file")
print("data", repr(data))
text = read_text("package.templates", "temp_file")
print("text", repr(text))

怎么了 好吧,不幸的是,这还行不通… 这仍然是一个不完整的API,使用importlib.resources它将需要您添加一个空文件templates/__init__.py,以便数据文件位于子包中而不是子目录中。它还会自行将package/templates子目录显示为可导入package.templates子包。如果这没什么大不了的,并且不会打扰您,那么您可以继续在__init__.py此处添加文件,然后使用导入系统访问资源。但是,当您使用它时,也可以将其放入my_resources.py文件中,只需在模块中定义一些字节或字符串变量,然后将其导入Python代码即可。无论哪种方式,都是进口系统在做繁重的工作。

示例项目:

我已经在github上创建了一个示例项目,并上传到PyPI上,该项目演示了上面讨论的所有四种方法。试试看:

$ pip install resources-example
$ resources-example

有关更多信息,请参见https://github.com/wimglenn/resources-example

A packaging prelude:

Before you can even worry about reading resource files, the first step is to make sure that the data files are getting packaged into your distribution in the first place – it is easy to read them directly from the source tree, but the important part is making sure these resource files are accessible from code within an installed package.

Structure your project like this, putting data files into a subdirectory within the package:

.
├── package
│   ├── __init__.py
│   ├── templates
│   │   └── temp_file
│   ├── mymodule1.py
│   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py

You should pass in the setup() call. The manifest file is only needed if you want to use setuptools/distutils and build source distributions. To make sure the templates/temp_file gets packaged for this example project structure, add a line like this into the manifest file:

recursive-include package *

Historical cruft note: Using a manifest file is not needed for modern build backends such as flit, poetry, which will include the package data files by default. So, if you’re using pyproject.toml and you don’t have a setup.py file then you can ignore all the stuff about MANIFEST.in.

Now, with packaging out of the way, onto the reading part…

Recommendation:

Use standard library pkgutil APIs. It’s going to look like this in library code:

# within package/mymodule1.py, for example
import pkgutil

data = pkgutil.get_data(__name__, "templates/temp_file")

It works in zips. It works on Python 2 and Python 3. It doesn’t require third-party dependencies. I’m not really aware of any downsides (if you are, then please comment on the answer).

Bad ways to avoid:

Bad way #1: using relative paths from a source file

This is currently the accepted answer. At best, it looks something like this:

from pathlib import Path

resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()

What’s wrong with that? The assumption that you have files and subdirectories available is not correct. This approach doesn’t work if executing code which is packed in a zip or a wheel, and it may be entirely out of the user’s control whether or not your package gets extracted to a filesystem at all.

Bad way #2: using pkg_resources APIs

This is described in the top-voted answer. It looks something like this:

from pkg_resources import resource_string

data = resource_string(__name__, "templates/temp_file")

What’s wrong with that? It adds a runtime dependency on setuptools, which should preferably be an install time dependency only. Importing and using pkg_resources can become really slow, as the code builds up a working set of all installed packages, even though you were only interested in your own package resources. That’s not a big deal at install time (since installation is once-off), but it’s ugly at runtime.

Bad way #3: using importlib.resources APIs

This is currently the recommendation in the top-voted answer. It’s a recent standard library addition (new in Python 3.7). It looks like this:

from importlib.resources import read_binary

data = read_binary("package.templates", "temp_file")

What’s wrong with that? Well, unfortunately, it doesn’t work…yet. This is still an incomplete API, using importlib.resources will require you to add an empty file templates/__init__.py in order that the data files will reside within a sub-package rather than in a subdirectory. It will also expose the package/templates subdirectory as an importable package.templates sub-package in its own right. If that’s not a big deal and it doesn’t bother you, then you can go ahead and add the __init__.py file there and use the import system to access resources. However, while you’re at it you may as well make it into a my_resources.py file instead, and just define some bytes or string variables in the module, then import them in Python code. It’s the import system doing the heavy lifting here either way.

Honorable mention: using newer importlib_resources APIs

This has not been mentioned in any other answers yet, but importlib_resources is more than a simple backport of the Python 3.7+ importlib.resources code. It has traversable APIs which you can use like this:

import importlib_resources

my_resources = importlib_resources.files("package")
data = (my_resources / "templates" / "temp_file").read_bytes()

This works on Python 2 and 3, it works in zips, and it doesn’t require spurious __init__.py files to be added in resource subdirectories. The only downside vs pkgutil that I can see is that these new APIs haven’t yet arrived in stdlib, so there is still a third-party dependency. Newer APIs from importlib_resources should arrive to stdlib importlib.resources in Python 3.9.

Example project:

I’ve created an example project on github and uploaded on PyPI, which demonstrates all five approaches discussed above. Try it out with:

$ pip install resources-example
$ resources-example

See https://github.com/wimglenn/resources-example for more info.


回答 3

如果你有这个结构

lidtk
├── bin
   └── lidtk
├── lidtk
   ├── analysis
      ├── char_distribution.py
      └── create_cm.py
   ├── classifiers
      ├── char_dist_metric_train_test.py
      ├── char_features.py
      ├── cld2
         ├── cld2_preds.txt
         └── cld2wili.py
      ├── get_cld2.py
      ├── text_cat
         ├── __init__.py
         ├── README.md   <---------- say you want to get this
         └── textcat_ngram.py
      └── tfidf_features.py
   ├── data
      ├── __init__.py
      ├── create_ml_dataset.py
      ├── download_documents.py
      ├── language_utils.py
      ├── pickle_to_txt.py
      └── wili.py
   ├── __init__.py
   ├── get_predictions.py
   ├── languages.csv
   └── utils.py
├── README.md
├── setup.cfg
└── setup.py

您需要以下代码:

import pkg_resources

# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md'  # always use slash
filepath = pkg_resources.resource_filename(__name__, path)

奇怪的“总是使用斜杠”部分来自setuptoolsAPI

还要注意,如果使用路径,则即使在Windows上,也必须使用正斜杠(/)作为路径分隔符。Setuptools在生成时自动将斜杠转换为适当的特定于平台的分隔符

如果您想知道文档在哪里:

In case you have this structure

lidtk
├── bin
│   └── lidtk
├── lidtk
│   ├── analysis
│   │   ├── char_distribution.py
│   │   └── create_cm.py
│   ├── classifiers
│   │   ├── char_dist_metric_train_test.py
│   │   ├── char_features.py
│   │   ├── cld2
│   │   │   ├── cld2_preds.txt
│   │   │   └── cld2wili.py
│   │   ├── get_cld2.py
│   │   ├── text_cat
│   │   │   ├── __init__.py
│   │   │   ├── README.md   <---------- say you want to get this
│   │   │   └── textcat_ngram.py
│   │   └── tfidf_features.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── create_ml_dataset.py
│   │   ├── download_documents.py
│   │   ├── language_utils.py
│   │   ├── pickle_to_txt.py
│   │   └── wili.py
│   ├── __init__.py
│   ├── get_predictions.py
│   ├── languages.csv
│   └── utils.py
├── README.md
├── setup.cfg
└── setup.py

you need this code:

import pkg_resources

# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md'  # always use slash
filepath = pkg_resources.resource_filename(__name__, path)

The strange “always use slash” part comes from setuptools APIs

Also notice that if you use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time

In case you wonder where the documentation is:


回答 4

David Beazley和Brian K. Jones撰写的Python Cookbook第三版“ 10.8。读取包中的数据文件”中的内容给出了答案。

我将它送到这里:

假设您有一个软件包,其文件组织如下:

mypackage/
    __init__.py
    somedata.dat
    spam.py

现在,假设文件spam.py要读取文件somedata.dat的内容。为此,请使用以下代码:

import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')

结果变量数据将是一个字节字符串,其中包含文件的原始内容。

get_data()的第一个参数是包含程序包名称的字符串。您可以直接提供它,也可以使用特殊变量,例如__package__。第二个参数是包中文件的相对名称。如有必要,您可以使用标准Unix文件名约定浏览到其他目录,只要最终目录仍位于包中即可。

这样,该软件包可以安装为目录,.zip或.egg。

The content in “10.8. Reading Datafiles Within a Package” of Python Cookbook, Third Edition by David Beazley and Brian K. Jones giving the answers.

I’ll just get it to here:

Suppose you have a package with files organized as follows:

mypackage/
    __init__.py
    somedata.dat
    spam.py

Now suppose the file spam.py wants to read the contents of the file somedata.dat. To do it, use the following code:

import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')

The resulting variable data will be a byte string containing the raw contents of the file.

The first argument to get_data() is a string containing the package name. You can either supply it directly or use a special variable, such as __package__. The second argument is the relative name of the file within the package. If necessary, you can navigate into different directories using standard Unix filename conventions as long as the final directory is still located within the package.

In this way, the package can installed as directory, .zip or .egg.


回答 5

包中的每个python模块都有一个__file__属性

您可以将其用作:

import os 
from mypackage

templates_dir = os.path.join(os.path.dirname(mypackage.__file__), 'templates')
template_file = os.path.join(templates_dir, 'template.txt')

有关鸡蛋资源,请参见:http : //peak.telecommunity.com/DevCenter/PythonEggs#accessing-package-resources

Every python module in your package has a __file__ attribute

You can use it as:

import os 
from mypackage

templates_dir = os.path.join(os.path.dirname(mypackage.__file__), 'templates')
template_file = os.path.join(templates_dir, 'template.txt')

For egg resources see: http://peak.telecommunity.com/DevCenter/PythonEggs#accessing-package-resources


回答 6

假设您使用的是鸡蛋文件;未提取:

我通过使用后安装脚本在最近的项目中“解决”了该问题,该脚本将我的模板从egg(zip文件)提取到文件系统中的正确目录。这是我发现的最快,最可靠的解决方案,因为__path__[0]有时使用会出错(我不记得这个名称了,但是我至少浏览了一个库,在列表的前面增加了一些东西!)。

通常,鸡蛋文件通常也被即时提取到一个称为“鸡蛋缓存”的临时位置。您可以在启动脚本之前甚至以后使用环境变量来更改该位置。

os.environ['PYTHON_EGG_CACHE'] = path

但是,有pkg_resources可能会正确完成此工作。

assuming you are using an egg file; not extracted:

I “solved” this in a recent project, by using a postinstall script, that extracts my templates from the egg (zip file) to the proper directory in the filesystem. It was the quickest, most reliable solution I found, since working with __path__[0] can go wrong sometimes (i don’t recall the name, but i cam across at least one library, that added something in front of that list!).

Also egg files are usually extracted on the fly to a temporary location called the “egg cache”. You can change that location using an environment variable, either before starting your script or even later, eg.

os.environ['PYTHON_EGG_CACHE'] = path

However there is pkg_resources that might do the job properly.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。