导入函数内部是pythonic吗?

问题:导入函数内部是pythonic吗?

PEP 8说:

  • 导入总是放在文件的顶部,紧随任何模块注释和文档字符串之后,以及模块全局变量和常量之前。

占用时,我违反了PEP8。有时,我在函数中导入了东西。通常,如果存在仅在单个函数中使用的导入,则执行此操作。

有什么意见吗?

编辑(我觉得导入函数的原因可能是个好主意):

主要原因:它可以使代码更清晰。

  • 在查看函数代码时,我可能会问自己:“函数/类xxx是什么?” (在函数内部使用xxx)。如果我的所有导入都在模块顶部,则必须去那里确定xxx是什么。使用时,这更成问题from m import xxx。看到m.xxx该功能可能会告诉我更多信息。取决于什么m:它是众所周知的顶级模块/软件包(import m)?还是子模块/包(from a.b.c import m)?
  • 在某些情况下,具有与使用xxx接近的位置的额外信息(“ xxx是什么?”)可以使功能更易于理解。

PEP 8 says:

  • Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

On occation, I violate PEP 8. Some times I import stuff inside functions. As a general rule, I do this if there is an import that is only used within a single function.

Any opinions?

EDIT (the reason I feel importing in functions can be a good idea):

Main reason: It can make the code clearer.

  • When looking at the code of a function I might ask myself: “What is function/class xxx?” (xxx being used inside the function). If I have all my imports at the top of the module, I have to go look there to determine what xxx is. This is more of an issue when using from m import xxx. Seeing m.xxx in the function probably tells me more. Depending on what m is: Is it a well-known top-level module/package (import m)? Or is it a sub-module/package (from a.b.c import m)?
  • In some cases having that extra information (“What is xxx?”) close to where xxx is used can make the function easier to understand.

回答 0

从长远来看,我想您会喜欢将大多数导入都放在文件顶部的方式,这样一来您就可以一目了然地判断模块需要导入的内容有多复杂。

如果要在现有文件中添加新代码,通常会在需要的地方进行导入,然后,如果代码保留下来,则可以通过将导入行移到文件顶部来使事情变得更永久。

还有一点,我更喜欢ImportError在运行任何代码之前获取一个异常-作为健全性检查,因此这是在顶部导入的另一个原因。

pyChecker用来检查未使用的模块。

In the long run I think you’ll appreciate having most of your imports at the top of the file, that way you can tell at a glance how complicated your module is by what it needs to import.

If I’m adding new code to an existing file I’ll usually do the import where it’s needed and then if the code stays I’ll make things more permanent by moving the import line to the top of the file.

One other point, I prefer to get an ImportError exception before any code is run — as a sanity check, so that’s another reason to import at the top.

I use pyChecker to check for unused modules.


回答 1

在这方面,我有两次违反PEP 8的情况:

  • 循环导入:模块A导入模块B,但是模块B中的某些东西需要模块A(尽管这通常表明我需要重构模块以消除循环依赖)
  • 插入pdb断点:import pdb; pdb.set_trace()这很方便,我不想放在import pdb可能要调试的每个模块的顶部,并且很容易记住在删除断点时要删除导入。

除了这两种情况外,最好将所有内容放在首位。它使依赖关系更加清晰。

There are two occasions where I violate PEP 8 in this regard:

  • Circular imports: module A imports module B, but something in module B needs module A (though this is often a sign that I need to refactor the modules to eliminate the circular dependency)
  • Inserting a pdb breakpoint: import pdb; pdb.set_trace() This is handy b/c I don’t want to put import pdb at the top of every module I might want to debug, and it easy to remember to remove the import when I remove the breakpoint.

Outside of these two cases, it’s a good idea to put everything at the top. It makes the dependencies clearer.


回答 2

这是我们使用的四个导入用例

  1. import(和from x import yimport x as y)在顶部

  2. 导入选择。在顶部。

    import settings
    if setting.something:
        import this as foo
    else:
        import that as foo
  3. 有条件的导入。与JSON,XML库等一起使用。在顶部。

    try:
        import this as foo
    except ImportError:
        import that as foo
  4. 动态导入。到目前为止,我们只有一个例子。

    import settings
    module_stuff = {}
    module= __import__( settings.some_module, module_stuff )
    x = module_stuff['x']

    请注意,这种动态导入不会引入代码,而是引入以Python编写的复杂数据结构。这有点像腌制的数据,只是我们手工腌制了。

    这或多或少都在模块的顶部


这是使代码更清晰的方法:

  • 保持模块简短。

  • 如果我将所有导入内容都放在模块顶部,则必须去那里确定名称。如果模块很短,那很容易做到。

  • 在某些情况下,使多余的信息靠近名称的使用位置可使该功能更易于理解。如果模块很短,那很容易做到。

Here are the four import use cases that we use

  1. import (and from x import y and import x as y) at the top

  2. Choices for Import. At the top.

    import settings
    if setting.something:
        import this as foo
    else:
        import that as foo
    
  3. Conditional Import. Used with JSON, XML libraries and the like. At the top.

    try:
        import this as foo
    except ImportError:
        import that as foo
    
  4. Dynamic Import. So far, we only have one example of this.

    import settings
    module_stuff = {}
    module= __import__( settings.some_module, module_stuff )
    x = module_stuff['x']
    

    Note that this dynamic import doesn’t bring in code, but brings in complex data structures written in Python. It’s kind of like a pickled piece of data except we pickled it by hand.

    This is also, more-or-less, at the top of a module


Here’s what we do to make the code clearer:

  • Keep the modules short.

  • If I have all my imports at the top of the module, I have to go look there to determine what a name is. If the module is short, that’s easy to do.

  • In some cases having that extra information close to where a name is used can make the function easier to understand. If the module is short, that’s easy to do.


回答 3

请记住一件事:不必要的导入可能会导致性能问题。因此,如果此函数经常被调用,则最好将导入放在顶部。当然,这一种优化,因此,如果有一个有效的案例可以证明,在函数内部的导入比在文件顶部的导入更清晰,那么在大多数情况下,这会降低性能。

如果您正在使用IronPython,则会被告知最好导入内部函数(因为在IronPython中编译代码可能很慢)。因此,您也许可以找到一种导入内部函数的方法。但是除此之外,我认为与常规作斗争是不值得的。

通常,如果存在仅在单个函数中使用的导入,则执行此操作。

我想提出的另一点是,这可能是潜在的维护问题。如果添加的功能使用的模块以前仅由一个功能使用,会发生什么情况?您是否还记得将导入添加到文件顶部?还是要扫描每个功能的导入?

FWIW,在某些情况下,在函数内部导入是有意义的。例如,如果要在cx_Oracle中设置语言,则需要导入之前设置NLS _LANG环境变量。因此,您可能会看到如下代码:

import os

oracle = None

def InitializeOracle(lang):
    global oracle
    os.environ['NLS_LANG'] = lang
    import cx_Oracle
    oracle = cx_Oracle

One thing to bear in mind: needless imports can cause performance problems. So if this is a function that will be called frequently, you’re better off just putting the import at the top. Of course this is an optimization, so if there’s a valid case to be made that importing inside a function is more clear than importing at the top of a file, that trumps performance in most cases.

If you’re doing IronPython, I’m told that it’s better to import inside functions (since compiling code in IronPython can be slow). Thus, you may be able to get a way with importing inside functions then. But other than that, I’d argue that it’s just not worth it to fight convention.

As a general rule, I do this if there is an import that is only used within a single function.

Another point I’d like to make is that this may be a potential maintenence problem. What happens if you add a function that uses a module that was previously used by only one function? Are you going to remember to add the import to the top of the file? Or are you going to scan each and every function for imports?

FWIW, there are cases where it makes sense to import inside a function. For example, if you want to set the language in cx_Oracle, you need to set an NLS_LANG environment variable before it is imported. Thus, you may see code like this:

import os

oracle = None

def InitializeOracle(lang):
    global oracle
    os.environ['NLS_LANG'] = lang
    import cx_Oracle
    oracle = cx_Oracle

回答 4

对于自测模块,我之前已经打破了此规则。也就是说,它们通常仅用于支持,但是我为它们定义了一个主要版本,因此,如果您自己运行它们,则可以测试其功能。在那种情况下,我有时会导入getoptcmd只是进入main,因为我希望阅读代码的人可以清楚地知道这些模块与模块的正常运行无关,仅包含在测试中。

I’ve broken this rule before for modules that are self-testing. That is, they are normally just used for support, but I define a main for them so that if you run them by themselves you can test their functionality. In that case I sometimes import getopt and cmd just in main, because I want it to be clear to someone reading the code that these modules have nothing to do with the normal operation of the module and are only being included for testing.


回答 5

来自关于 两次加载模块 -为什么不两者都?

脚本顶部的导入将指示依赖关系,并且该函数中的另一个导入将使该函数更具原子性,同时由于连续导入的成本较低,因此似乎不会造成任何性能劣势。

Coming from the question about loading the module twice – Why not both?

An import at the top of the script will indicate the dependencies and another import in the function with make this function more atomic, while seemingly not causing any performance disadvantage, since a consecutive import is cheap.


回答 6

只要它importfrom x import *,您就应该将它们放在顶部。它仅向全局命名空间添加一个名称,并且您坚持使用PEP8。此外,如果以后需要在其他地方使用它,则无需四处移动。

没什么大不了的,但是由于几乎没有区别,所以我建议按照PEP 8的说明进行操作。

As long as it’s import and not from x import *, you should put them at the top. It adds just one name to the global namespace, and you stick to PEP 8. Plus, if you later need it somewhere else, you don’t have to move anything around.

It’s no big deal, but since there’s almost no difference I’d suggest doing what PEP 8 says.


回答 7

看看sqlalchemy中使用的替代方法:依赖项注入:

@util.dependencies("sqlalchemy.orm.query")
def merge_result(query, *args):
    #...
    query.Query(...)

注意导入的库如何在装饰器中声明,并作为参数传递给函数

这种方法使代码更整洁,并且比语句快4.5倍import

基准:https : //gist.github.com/kolypto/589e84fbcfb6312532658df2fabdb796

Have a look at the alternative approach that’s used in sqlalchemy: dependency injection:

@util.dependencies("sqlalchemy.orm.query")
def merge_result(query, *args):
    #...
    query.Query(...)

Notice how the imported library is declared in a decorator, and passed as an argument to the function!

This approach makes the code cleaner, and also works 4.5 times faster than an import statement!

Benchmark: https://gist.github.com/kolypto/589e84fbcfb6312532658df2fabdb796


回答 8

在既是“正常”模块又可以执行的模块中(即 if __name__ == '__main__': -section),我通常导入仅在主要部分内执行模块时使用的模块。

例:

def really_useful_function(data):
    ...


def main():
    from pathlib import Path
    from argparse import ArgumentParser
    from dataloader import load_data_from_directory

    parser = ArgumentParser()
    parser.add_argument('directory')
    args = parser.parse_args()
    data = load_data_from_directory(Path(args.directory))
    print(really_useful_function(data)


if __name__ == '__main__':
    main()

In modules that are both ‘normal’ modules and can be executed (i.e. have a if __name__ == '__main__':-section), I usually import modules that are only used when executing the module inside the main section.

Example:

def really_useful_function(data):
    ...


def main():
    from pathlib import Path
    from argparse import ArgumentParser
    from dataloader import load_data_from_directory

    parser = ArgumentParser()
    parser.add_argument('directory')
    args = parser.parse_args()
    data = load_data_from_directory(Path(args.directory))
    print(really_useful_function(data)


if __name__ == '__main__':
    main()

回答 9

还有另一种(可能是“角落”)情况,这种情况可能会对import内部很少使用的功能有利:缩短启动时间。

我曾经在小型IoT服务器上运行一个相当复杂的程序来碰壁,它接受来自串行线路的命令并执行操作,可能是非常复杂的操作。

import语句放在文件顶部意味着在服务器启动之前已处理所有导入;因为import名单中包括jinja2lxmlsignxml等“重物”(和SoC是不是很厉害),这意味着分钟的第一个指令之前实际执行。

OTOH将大多数导入放置在功能中,我能够在几秒钟内使服务器在串行线上“运行”。当然,当实际需要模块时,我必须付出代价(注意:这也可以通过import在空闲时间内生成后台任务来缓解)。

There’s another (probably “corner”) case where it may be beneficial to import inside rarely used functions: shorten startup time.

I hit that wall once with a rather complex program running on a small IoT server accepting commands from a serial line and performing operations, possibly very complex operations.

Placing import statements at top of files meant to have all imports processed before server start; since import list included jinja2, lxml, signxml and other “heavy weights” (and SoC was not very powerful) this meant minutes before the first instruction was actually executed.

OTOH placing most imports in functions I was able to have the server “alive” on the serial line in seconds. Of course when the modules were actually needed I had to pay the price (Note: also this can be mitigated by spawning a background task doing imports in idle time).


类型对象“ datetime.datetime”没有属性“ datetime”

问题:类型对象“ datetime.datetime”没有属性“ datetime”

我收到以下错误:

类型对象“ datetime.datetime”没有属性“ datetime”

在下一行:

date = datetime.datetime(int(year), int(month), 1)

有人知道错误的原因吗?

我导入日期时间from datetime import datetime是否有帮助

谢谢

I have gotten the following error:

type object ‘datetime.datetime’ has no attribute ‘datetime’

On the following line:

date = datetime.datetime(int(year), int(month), 1)

Does anybody know the reason for the error?

I imported datetime with from datetime import datetime if that helps

Thanks


回答 0

日期时间是一个允许处理日期,时间和日期时间(所有都是数据类型)的模块。这意味着datetime它既是顶级模块,又是该模块中的一种类型。这很混乱。

您的错误可能是基于模块的混乱命名,而您或您正在使用的模块已经导入了。

>>> import datetime
>>> datetime
<module 'datetime' from '/usr/lib/python2.6/lib-dynload/datetime.so'>
>>> datetime.datetime(2001,5,1)
datetime.datetime(2001, 5, 1, 0, 0)

但是,如果您导入datetime.datetime:

>>> from datetime import datetime
>>> datetime
<type 'datetime.datetime'>
>>> datetime.datetime(2001,5,1) # You shouldn't expect this to work 
                                # as you imported the type, not the module
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'datetime.datetime' has no attribute 'datetime'
>>> datetime(2001,5,1)
datetime.datetime(2001, 5, 1, 0, 0)

我怀疑您或您正在使用的模块之一已这样导入: from datetime import datetime

Datetime is a module that allows for handling of dates, times and datetimes (all of which are datatypes). This means that datetime is both a top-level module as well as being a type within that module. This is confusing.

Your error is probably based on the confusing naming of the module, and what either you or a module you’re using has already imported.

>>> import datetime
>>> datetime
<module 'datetime' from '/usr/lib/python2.6/lib-dynload/datetime.so'>
>>> datetime.datetime(2001,5,1)
datetime.datetime(2001, 5, 1, 0, 0)

But, if you import datetime.datetime:

>>> from datetime import datetime
>>> datetime
<type 'datetime.datetime'>
>>> datetime.datetime(2001,5,1) # You shouldn't expect this to work 
                                # as you imported the type, not the module
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'datetime.datetime' has no attribute 'datetime'
>>> datetime(2001,5,1)
datetime.datetime(2001, 5, 1, 0, 0)

I suspect you or one of the modules you’re using has imported like this: from datetime import datetime.


回答 1

对于python 3.3

from datetime import datetime, timedelta
futuredate = datetime.now() + timedelta(days=10)

For python 3.3

from datetime import datetime, timedelta
futuredate = datetime.now() + timedelta(days=10)

回答 2

你应该用

date = datetime(int(year), int(month), 1)

或改变

from datetime import datetime

import datetime

You should use

date = datetime(int(year), int(month), 1)

Or change

from datetime import datetime

to

import datetime

回答 3

您实际上应该将模块导入其自己的别名中

import datetime as dt
my_datetime = dt.datetime(year, month, day)

与其他解决方案相比,以上优点如下:

  • 调用变量my_datetime而不是date减少混乱,因为datedatetime模块中已经有一个(datetime.date)。
  • 模块和类(都称为datetime)不会相互遮挡。

You should really import the module into its own alias.

import datetime as dt
my_datetime = dt.datetime(year, month, day)

The above has the following benefits over the other solutions:

  • Calling the variable my_datetime instead of date reduces confusion since there is already a date in the datetime module (datetime.date).
  • The module and the class (both called datetime) do not shadow each other.

回答 4

如果您使用过:

from datetime import datetime

然后只需将代码编写为:

date = datetime(int(year), int(month), 1)

但是,如果您使用过:

import datetime

那么只有你可以写:

date = datetime.datetime(int(2005), int(5), 1)

If you have used:

from datetime import datetime

Then simply write the code as:

date = datetime(int(year), int(month), 1)

But if you have used:

import datetime

then only you can write:

date = datetime.datetime(int(2005), int(5), 1)

回答 5

我发现这要容易得多

from dateutil import relativedelta
relativedelta.relativedelta(end_time,start_time).seconds

I found this to be a lot easier

from dateutil import relativedelta
relativedelta.relativedelta(end_time,start_time).seconds

回答 6

我遇到了同样的错误,也许您已经通过仅使用导入了模块,import datetime所以将其更改 form datetime import datetime为only import datetime。我改回来后对我有用。

I run into the same error maybe you have already imported the module by using only import datetime so change form datetime import datetime to only import datetime. It worked for me after I changed it back.


回答 7

from datetime import datetime
import time
from calendar import timegm
d = datetime.utcnow()
d = d.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
utc_time = time.strptime(d,"%Y-%m-%dT%H:%M:%S.%fZ")
epoch_time = timegm(utc_time)
from datetime import datetime
import time
from calendar import timegm
d = datetime.utcnow()
d = d.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
utc_time = time.strptime(d,"%Y-%m-%dT%H:%M:%S.%fZ")
epoch_time = timegm(utc_time)

如何将Python的.py转换为.exe?

问题:如何将Python的.py转换为.exe?

我试图将一个相当简单的Python程序转换为可执行文件,但是找不到我想要的东西,所以我有几个问题(我正在运行Python 3.6):

到目前为止,我发现的方法如下

  1. 下载旧版本的Python并使用 pyinstaller/py2exe
  2. 在Python 3.6中设置一个虚拟环境,这将允许我执行1。
  3. 下载Python到C ++转换器并使用它。

这是我尝试过的/遇到的问题。

  • 我在安装pyinstaller所需的下载之前安装了它(pypi-something),所以它无法正常工作。下载必备文件后,pyinstaller仍然无法识别它。
  • 如果要在Python 2.7中设置virtualenv,我是否真的需要安装Python 2.7?
  • 同样,我看到的唯一的python至C ++转换器只能在Python 3.5之前工作-尝试这样做是否需要下载并使用此版本?

I’m trying to convert a fairly simple Python program to an executable and couldn’t find what I was looking for, so I have a few questions (I’m running Python 3.6):

The methods of doing this that I have found so far are as follows

  1. downloading an old version of Python and using pyinstaller/py2exe
  2. setting up a virtual environment in Python 3.6 that will allow me to do 1.
  3. downloading a Python to C++ converter and using that.

Here is what I’ve tried/what problems I’ve run into.

  • I installed pyinstaller before the required download before it (pypi-something) so it did not work. After downloading the prerequisite file, pyinstaller still does not recognize it.
  • If I’m setting up a virtualenv in Python 2.7, do I actually need to have Python 2.7 installed?
  • similarly, the only python to C++ converters I see work only up until Python 3.5 – do I need to download and use this version if attempting this?

回答 0

在Python 3.6中将.py转换为.exe的步骤

  1. 安装Python 3.6
  2. 安装cx_Freeze,(打开命令提示符并输入pip install cx_Freeze
  3. 安装idna,(打开命令提示符并输入pip install idna
  4. 编写一个.py名为的程序myfirstprog.py
  5. setup.py在脚本的当前目录中创建一个新的python文件。
  6. setup.py文件中,复制下面的代码并保存。
  7. 按住Shift键并在同一目录上单击鼠标右键,因此您可以打开命令提示符窗口。
  8. 在提示中,键入 python setup.py build
  9. 如果您的脚本没有错误,那么创建应用程序将没有问题。
  10. 检查新创建的文件夹build。它有另一个文件夹。在该文件夹中,您可以找到您的应用程序。运行。让自己开心。

请参阅我的博客中的原始脚本。

setup.py:

from cx_Freeze import setup, Executable

base = None    

executables = [Executable("myfirstprog.py", base=base)]

packages = ["idna"]
options = {
    'build_exe': {    
        'packages':packages,
    },    
}

setup(
    name = "<any name>",
    options = options,
    version = "<any number>",
    description = '<any description>',
    executables = executables
)

编辑:

  • 确保不要myfirstprog.py步骤4中.py创建的扩展名放在文件名中;
  • 你应该包括每import版包您.pypackages列表(例如:packages = ["idna", "os","sys"]
  • any name, any number, any descriptionsetup.py文件不应保持不变,就应该相应地改变它(例如:name = "<first_ever>", version = "0.11", description = ''
  • import你开始之前,编辑软件包必须安装第8步

Steps to convert .py to .exe in Python 3.6

  1. Install Python 3.6.
  2. Install cx_Freeze, (open your command prompt and type pip install cx_Freeze.
  3. Install idna, (open your command prompt and type pip install idna.
  4. Write a .py program named myfirstprog.py.
  5. Create a new python file named setup.py on the current directory of your script.
  6. In the setup.py file, copy the code below and save it.
  7. With shift pressed right click on the same directory, so you are able to open a command prompt window.
  8. In the prompt, type python setup.py build
  9. If your script is error free, then there will be no problem on creating application.
  10. Check the newly created folder build. It has another folder in it. Within that folder you can find your application. Run it. Make yourself happy.

See the original script in my blog.

setup.py:

from cx_Freeze import setup, Executable

base = None    

executables = [Executable("myfirstprog.py", base=base)]

packages = ["idna"]
options = {
    'build_exe': {    
        'packages':packages,
    },    
}

setup(
    name = "<any name>",
    options = options,
    version = "<any number>",
    description = '<any description>',
    executables = executables
)

EDIT:

  • be sure that instead of myfirstprog.py you should put your .pyextension file name as created in step 4;
  • you should include each imported package in your .py into packages list (ex: packages = ["idna", "os","sys"])
  • any name, any number, any description in setup.py file should not remain the same, you should change it accordingly (ex:name = "<first_ever>", version = "0.11", description = '' )
  • the imported packages must be installed before you start step 8.

回答 1

PyInstaller支持Python 3.6。

在您的Python文件夹中打开一个cmd窗口(打开命令窗口并使用cd或按住shift键,在Windows资源管理器中右键单击它,然后选择“在此处打开命令窗口”)。然后输入

pip install pyinstaller

就是这样。

使用它的最简单方法是在命令提示符下输入

pyinstaller file_name.py

有关如何使用它的更多详细信息,请查看此问题

Python 3.6 is supported by PyInstaller.

Open a cmd window in your Python folder (open a command window and use cd or while holding shift, right click it on Windows Explorer and choose ‘Open command window here’). Then just enter

pip install pyinstaller

And that’s it.

The simplest way to use it is by entering on your command prompt

pyinstaller file_name.py

For more details on how to use it, take a look at this question.


回答 2

GitHub上有一个名为auto-py-to-exe的开源项目。实际上,它也仅在内部使用PyInstaller,但由于它具有控制PyInstaller的简单GUI,因此它可能是一个舒适的选择。与其他解决方案相比,它还可以输出独立文件。他们还提供了视频,展示了如何进行设置。

界面:

输出:

There is an open source project called auto-py-to-exe on GitHub. Actually it also just uses PyInstaller internally but since it is has a simple GUI that controls PyInstaller it may be a comfortable alternative. It can also output a standalone file in contrast to other solutions. They also provide a video showing how to set it up.

GUI:

Output:


回答 3

我无法告诉您什么是最好的,但是我过去成功使用的工具是cx_Freeze。他们最近(在17年1月7日)更新到了5.0.1版,它支持Python 3.6。

这是Pypi https://pypi.python.org/pypi/cx_Freeze

该文档显示,有多种方法可以完成此操作,具体取决于您的需求。 http://cx-freeze.readthedocs.io/en/latest/overview.html

我还没有尝试过,所以我将指向一个帖子,其中讨论了执行此操作的简单方法。有些事情可能会或可能不会改变。

如何使用cx_freeze?

I can’t tell you what’s best, but a tool I have used with success in the past was cx_Freeze. They recently updated (on Jan. 7, ’17) to version 5.0.1 and it supports Python 3.6.

Here’s the pypi https://pypi.python.org/pypi/cx_Freeze

The documentation shows that there is more than one way to do it, depending on your needs. http://cx-freeze.readthedocs.io/en/latest/overview.html

I have not tried it out yet, so I’m going to point to a post where the simple way of doing it was discussed. Some things may or may not have changed though.

How do I use cx_freeze?


回答 4

我一直在我的软件包PySimpleGUI中使用Nuitka和PyInstaller。

努伊特卡 存在使tkinter与Nuikta进行编译的问题。一个项目贡献者开发了一个脚本来解决该问题。

如果您不使用tkinter,则可能对您“有效”。如果您使用的是tkinter,请这样说,我将尝试发布脚本和说明。

PyInstaller 我正在运行3.6,PyInstaller运行良好!我用来创建exe文件的命令是:

pyinstaller -wF myfile.py

-wF将创建一个EXE文件。因为我的所有程序都具有GUI,并且我不想显示命令窗口,所以-w选项将隐藏命令窗口。

这几乎就像运行用Python编写的Winforms程序一样

[2019年7月20日更新]

有使用PyInstaller的基于PySimpleGUI GUI的解决方案。它使用PySimpleGUI。它称为pysimplegui-exemaker,可以进行pip安装。

pip install PySimpleGUI-exemaker

要在安装后运行它:

python -m pysimplegui-exemaker.pysimplegui-exemaker

I’ve been using Nuitka and PyInstaller with my package, PySimpleGUI.

Nuitka There were issues getting tkinter to compile with Nuikta. One of the project contributors developed a script that fixed the problem.

If you’re not using tkinter it may “just work” for you. If you are using tkinter say so and I’ll try to get the script and instructions published.

PyInstaller I’m running 3.6 and PyInstaller is working great! The command I use to create my exe file is:

pyinstaller -wF myfile.py

The -wF will create a single EXE file. Because all of my programs have a GUI and I do not want to command window to show, the -w option will hide the command window.

This is as close to getting what looks like a Winforms program to run that was written in Python.

[Update 20-Jul-2019]

There is PySimpleGUI GUI based solution that uses PyInstaller. It uses PySimpleGUI. It’s called pysimplegui-exemaker and can be pip installed.

pip install PySimpleGUI-exemaker

To run it after installing:

python -m pysimplegui-exemaker.pysimplegui-exemaker


回答 5

现在,您可以使用PyInstaller进行转换。我甚至使用Python 3。

脚步:

  1. 启动电脑
  2. 打开命令提示符
  3. 输入命令 pip install pyinstaller
  4. 安装后,使用命令“ cd”转到工作目录。
  5. 运行命令 pyinstall <filename>

Now you can convert it by using PyInstaller. It works with even Python 3.

Steps:

  1. Fire up your PC
  2. Open command prompt
  3. Enter command pip install pyinstaller
  4. When it is installed, use the command ‘cd’ to go to the working directory.
  5. Run command pyinstall <filename>

python pip:强制安装忽略依赖项

问题:python pip:强制安装忽略依赖项

有什么方法可以强制安装pip python软件包,而忽略所有无法满足的依赖关系吗?

(我不在乎这样做有多么“错误”,我只需要这样做,除了逻辑和推理之外……)

Is there any way to force install a pip python package ignoring all it’s dependencies that cannot be satisfied?

(I don’t care how “wrong” it is to do so, I just need to do it, any logic and reasoning aside…)


回答 0

点有一个--no-dependencies开关。您应该使用它。

有关更多信息,请运行pip install -h,在这里您将看到以下行:

--no-deps, --no-dependencies
                        Ignore package dependencies

pip has a --no-dependencies switch. You should use that.

For more information, run pip install -h, where you’ll see this line:

--no-deps, --no-dependencies
                        Ignore package dependencies

回答 1

当我尝试librosa使用pippip install librosa)安装软件包时,出现此错误:

ERROR: Cannot uninstall 'llvmlite'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

我尝试删除llvmlite,但pip uninstall无法删除它。因此,我通过以下代码使用了ignoreof pip的功能:

pip install librosa --ignore-installed llvmlite

确实,您可以使用此规则来忽略您不想考虑的软件包:

pip install {package you want to install} --ignore-installed {installed package you don't want to consider}

When I were trying install librosa package with pip (pip install librosa), this error were appeared:

ERROR: Cannot uninstall 'llvmlite'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

I tried to remove llvmlite, but pip uninstall could not remove it. So, I used capability of ignore of pip by this code:

pip install librosa --ignore-installed llvmlite

Indeed, you can use this rule for ignoring a package you don’t want to consider:

pip install {package you want to install} --ignore-installed {installed package you don't want to consider}

使用熊猫合并时如何保持索引

问题:使用熊猫合并时如何保持索引

我想合并两个DataFrames,并保留第一帧的索引作为合并数据集中的索引。但是,当我进行合并时,所得的DataFrame具有整数索引。如何指定要保留左侧数据框中的索引?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

编辑:切换到示例代码,可以轻松地复制

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced


回答 0

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

注意:对于某些左合并操作,如果和之间存在多个匹配项,则最终可能会出现更多行ab并且需要进行重复数据删除(有关重复数据删除的文档)。这就是为什么熊猫不为您保留索引的原因。

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.


回答 1

您可以在左侧数据框上复制索引并进行合并。

a['copy_index'] = a.index
a.merge(b, how='left')

我发现在处理大型数据框和使用pd.merge_asof()(或dd.merge_asof())时,此简单方法非常有用。

当重置索引很昂贵(大数据帧)时,这种方法会更好。

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).


回答 2

有一个非pd.merge解决方案。使用mapset_index

In [1744]: a.assign(col2=a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
Out[1744]:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

并且,不要index为索引引入虚拟名称。

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

In: a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
In: a['col2']
Out:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn’t introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.


回答 3

df1 = df1.merge(
        df2, how="inner", left_index=True, right_index=True
    )

这样可以保留df1的索引

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1


回答 4

认为我想出了一个不同的解决方案。我是根据左表的索引将左表与索引值连接在一起,将右表与列值连接在一起。我所做的是普通合并:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

然后,我从合并表中检索了新的索引号,并将它们放在名为“情感行号”的新列中:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

然后,我基于称为行号(我从左表索引加入的列值)的现有列,将索引手动设置回原始的左表索引:

First10ReviewsJoined.set_index('Line Number', inplace=True)

然后删除行号的索引名称,使其保持空白:

First10ReviewsJoined.index.name = None

也许有点破解,但似乎运行良好且相对简单。另外,猜测它会减少重复/混乱数据的风险。希望一切都有意义。

Think I’ve come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.


回答 5

另一个简单的选择是将索引重命名为之前的名称:

a.merge(b, how="left").set_axis(a.index)

合并保留数据帧“ a”的顺序,但只是重置索引,因此可以保存以使用set_axis

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe ‘a’, but just resets the index so it’s save to use set_axis


如何计算列表中的唯一值

问题:如何计算列表中的唯一值

因此,我试图制作一个程序来询问用户输入并将值存储在数组/列表中。
然后,当输入空白行时,它将告诉用户这些值中有多少是唯一的。
我出于现实原因而不是问题集来构建它。

enter: happy
enter: rofl
enter: happy
enter: mpg8
enter: Cpp
enter: Cpp
enter:
There are 4 unique words!

我的代码如下:

# ask for input
ipta = raw_input("Word: ")

# create list 
uniquewords = [] 
counter = 0
uniquewords.append(ipta)

a = 0   # loop thingy
# while loop to ask for input and append in list
while ipta: 
  ipta = raw_input("Word: ")
  new_words.append(input1)
  counter = counter + 1

for p in uniquewords:

..这就是到目前为止我所获得的一切。
我不确定如何计算列表中单词的唯一数量?
如果有人可以发布解决方案,以便我可以学习它,或者至少告诉我它会是多么棒,谢谢!

So I’m trying to make this program that will ask the user for input and store the values in an array / list.
Then when a blank line is entered it will tell the user how many of those values are unique.
I’m building this for real life reasons and not as a problem set.

enter: happy
enter: rofl
enter: happy
enter: mpg8
enter: Cpp
enter: Cpp
enter:
There are 4 unique words!

My code is as follows:

# ask for input
ipta = raw_input("Word: ")

# create list 
uniquewords = [] 
counter = 0
uniquewords.append(ipta)

a = 0   # loop thingy
# while loop to ask for input and append in list
while ipta: 
  ipta = raw_input("Word: ")
  new_words.append(input1)
  counter = counter + 1

for p in uniquewords:

..and that’s about all I’ve gotten so far.
I’m not sure how to count the unique number of words in a list?
If someone can post the solution so I can learn from it, or at least show me how it would be great, thanks!


回答 0

另外,使用collections.Counter重构代码:

from collections import Counter

words = ['a', 'b', 'c', 'a']

Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency

输出:

['a', 'c', 'b']
[2, 1, 1]

In addition, use collections.Counter to refactor your code:

from collections import Counter

words = ['a', 'b', 'c', 'a']

Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency

Output:

['a', 'c', 'b']
[2, 1, 1]

回答 1

您可以使用集合删除重复项,然后使用len函数计算集合中的元素:

len(set(new_words))

You can use a set to remove duplicates, and then the len function to count the elements in the set:

len(set(new_words))

回答 2

values, counts = np.unique(words, return_counts=True)

values, counts = np.unique(words, return_counts=True)


回答 3

使用一

words = ['a', 'b', 'c', 'a']
unique_words = set(words)             # == set(['a', 'b', 'c'])
unique_word_count = len(unique_words) # == 3

有了这个,您的解决方案就可以很简单:

words = []
ipta = raw_input("Word: ")

while ipta:
  words.append(ipta)
  ipta = raw_input("Word: ")

unique_word_count = len(set(words))

print "There are %d unique words!" % unique_word_count

Use a set:

words = ['a', 'b', 'c', 'a']
unique_words = set(words)             # == set(['a', 'b', 'c'])
unique_word_count = len(unique_words) # == 3

Armed with this, your solution could be as simple as:

words = []
ipta = raw_input("Word: ")

while ipta:
  words.append(ipta)
  ipta = raw_input("Word: ")

unique_word_count = len(set(words))

print "There are %d unique words!" % unique_word_count

回答 4

aa="XXYYYSBAA"
bb=dict(zip(list(aa),[list(aa).count(i) for i in list(aa)]))
print(bb)
# output:
# {'X': 2, 'Y': 3, 'S': 1, 'B': 1, 'A': 2}
aa="XXYYYSBAA"
bb=dict(zip(list(aa),[list(aa).count(i) for i in list(aa)]))
print(bb)
# output:
# {'X': 2, 'Y': 3, 'S': 1, 'B': 1, 'A': 2}

回答 5

对于ndarray,有一个称为unique的numpy方法:

np.unique(array_name)

例子:

>>> np.unique([1, 1, 2, 2, 3, 3])
array([1, 2, 3])
>>> a = np.array([[1, 1], [2, 3]])
>>> np.unique(a)
array([1, 2, 3])

对于系列,有一个函数调用value_counts()

Series_name.value_counts()

For ndarray there is a numpy method called unique:

np.unique(array_name)

Examples:

>>> np.unique([1, 1, 2, 2, 3, 3])
array([1, 2, 3])
>>> a = np.array([[1, 1], [2, 3]])
>>> np.unique(a)
array([1, 2, 3])

For a Series there is a function call value_counts():

Series_name.value_counts()

回答 6

ipta = raw_input("Word: ") ## asks for input
words = [] ## creates list
unique_words = set(words)
ipta = raw_input("Word: ") ## asks for input
words = [] ## creates list
unique_words = set(words)

回答 7

尽管集合是最简单的方法,但是您也可以使用some_dict.has(key)字典并仅使用唯一的键和值来填充字典。

假设您已经填充words[]了用户的输入,请创建一个字典,将列表中的唯一单词映射到数字:

word_map = {}
i = 1
for j in range(len(words)):
    if not word_map.has_key(words[j]):
        word_map[words[j]] = i
        i += 1                                                             
num_unique_words = len(new_map) # or num_unique_words = i, however you prefer

Although a set is the easiest way, you could also use a dict and use some_dict.has(key) to populate a dictionary with only unique keys and values.

Assuming you have already populated words[] with input from the user, create a dict mapping the unique words in the list to a number:

word_map = {}
i = 1
for j in range(len(words)):
    if not word_map.has_key(words[j]):
        word_map[words[j]] = i
        i += 1                                                             
num_unique_words = len(new_map) # or num_unique_words = i, however you prefer

回答 8

使用熊猫的其他方法

import pandas as pd

LIST = ["a","a","c","a","a","v","d"]
counts,values = pd.Series(LIST).value_counts().values, pd.Series(LIST).value_counts().index
df_results = pd.DataFrame(list(zip(values,counts)),columns=["value","count"])

然后,您可以以任何所需的格式导出结果

Other method by using pandas

import pandas as pd

LIST = ["a","a","c","a","a","v","d"]
counts,values = pd.Series(LIST).value_counts().values, pd.Series(LIST).value_counts().index
df_results = pd.DataFrame(list(zip(values,counts)),columns=["value","count"])

You can then export results in any format you want


回答 9

怎么样:

import pandas as pd
#List with all words
words=[]

#Code for adding words
words.append('test')


#When Input equals blank:
pd.Series(words).nunique()

它返回列表中有多少个唯一值

How about:

import pandas as pd
#List with all words
words=[]

#Code for adding words
words.append('test')


#When Input equals blank:
pd.Series(words).nunique()

It returns how many unique values are in a list


回答 10

以下应该工作。lambda函数过滤掉重复的单词。

inputs=[]
input = raw_input("Word: ").strip()
while input:
    inputs.append(input)
    input = raw_input("Word: ").strip()
uniques=reduce(lambda x,y: ((y in x) and x) or x+[y], inputs, [])
print 'There are', len(uniques), 'unique words'

The following should work. The lambda function filter out the duplicated words.

inputs=[]
input = raw_input("Word: ").strip()
while input:
    inputs.append(input)
    input = raw_input("Word: ").strip()
uniques=reduce(lambda x,y: ((y in x) and x) or x+[y], inputs, [])
print 'There are', len(uniques), 'unique words'

回答 11

我会自己使用一套,但这是另一种方式:

uniquewords = []
while True:
    ipta = raw_input("Word: ")
    if ipta == "":
        break
    if not ipta in uniquewords:
        uniquewords.append(ipta)
print "There are", len(uniquewords), "unique words!"

I’d use a set myself, but here’s yet another way:

uniquewords = []
while True:
    ipta = raw_input("Word: ")
    if ipta == "":
        break
    if not ipta in uniquewords:
        uniquewords.append(ipta)
print "There are", len(uniquewords), "unique words!"

回答 12

ipta = raw_input("Word: ") ## asks for input
words = [] ## creates list

while ipta: ## while loop to ask for input and append in list
  words.append(ipta)
  ipta = raw_input("Word: ")
  words.append(ipta)
#Create a set, sets do not have repeats
unique_words = set(words)

print "There are " +  str(len(unique_words)) + " unique words!"
ipta = raw_input("Word: ") ## asks for input
words = [] ## creates list

while ipta: ## while loop to ask for input and append in list
  words.append(ipta)
  ipta = raw_input("Word: ")
  words.append(ipta)
#Create a set, sets do not have repeats
unique_words = set(words)

print "There are " +  str(len(unique_words)) + " unique words!"

如何使用python获取文件夹中的最新文件

问题:如何使用python获取文件夹中的最新文件

我需要使用python获取文件夹的最新文件。使用代码时:

max(files, key = os.path.getctime)

我收到以下错误:

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'a'

I need to get the latest file of a folder using python. While using the code:

max(files, key = os.path.getctime)

I am getting the below error:

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'a'


回答 0

分配给files变量的任何内容均不正确。使用以下代码。

import glob
import os

list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print latest_file

Whatever is assigned to the files variable is incorrect. Use the following code.

import glob
import os

list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print latest_file

回答 1

max(files, key = os.path.getctime)

是非常不完整的代码。什么files啊 可能是来自的文件名列表os.listdir()

但是此列表仅列出了文件名部分(也称为“基本名称”),因为它们的路径是通用的。为了正确使用它,您必须将其与通向它的路径结合起来(并用于获得它)。

如(未测试):

def newest(path):
    files = os.listdir(path)
    paths = [os.path.join(path, basename) for basename in files]
    return max(paths, key=os.path.getctime)
max(files, key = os.path.getctime)

is quite incomplete code. What is files? It probably is a list of file names, coming out of os.listdir().

But this list lists only the filename parts (a. k. a. “basenames”), because their path is common. In order to use it correctly, you have to combine it with the path leading to it (and used to obtain it).

Such as (untested):

def newest(path):
    files = os.listdir(path)
    paths = [os.path.join(path, basename) for basename in files]
    return max(paths, key=os.path.getctime)

回答 2

我建议使用glob.iglob()代替glob.glob(),因为它效率更高。

glob.iglob()返回一个迭代器,该迭代器产生的值与glob()相同,而实际上并没有同时存储所有值。

意思是 glob.iglob()效率更高。

我主要使用以下代码查找与我的模式匹配的最新文件:

LatestFile = max(glob.iglob(fileNamePattern),key=os.path.getctime)


注意:max函数有多种变体,如果找到最新文件,我们将使用以下变体: max(iterable, *[, key, default])

它需要迭代,因此您的第一个参数应该是可迭代的。如果找到最大数量,我们可以使用beow变体:max (num1, num2, num3, *args[, key])

I would suggest using glob.iglob() instead of the glob.glob(), as it is more efficient.

glob.iglob() Return an iterator which yields the same values as glob() without actually storing them all simultaneously.

Which means glob.iglob() will be more efficient.

I mostly use below code to find the latest file matching to my pattern:

LatestFile = max(glob.iglob(fileNamePattern),key=os.path.getctime)


NOTE: There are variants of max function, In case of finding the latest file we will be using below variant: max(iterable, *[, key, default])

which needs iterable so your first parameter should be iterable. In case of finding max of nums we can use beow variant : max (num1, num2, num3, *args[, key])


回答 3

尝试按创建时间对项目排序。以下示例对文件夹中的文件进行排序,并获取最新的第一个元素。

import glob
import os

files_path = os.path.join(folder, '*')
files = sorted(
    glob.iglob(files_path), key=os.path.getctime, reverse=True) 
print files[0]

Try to sort items by creation time. Example below sorts files in a folder and gets first element which is latest.

import glob
import os

files_path = os.path.join(folder, '*')
files = sorted(
    glob.iglob(files_path), key=os.path.getctime, reverse=True) 
print files[0]

回答 4

我缺乏发表评论的声誉,但是Marlon Abeykoons回应的ctime并未为我提供正确的结果。使用mtime可以解决问题。(key = os.path.get m时间))

import glob
import os

list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getmtime)
print latest_file

对于该问题,我找到了两个答案:

python os.path.getctime max不返回最新的 python-unix系统中的getmtime()和getctime()之间的区别

I lack the reputation to comment but ctime from Marlon Abeykoons response did not give the correct result for me. Using mtime does the trick though. (key=os.path.getmtime))

import glob
import os

list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getmtime)
print latest_file

I found two answers for that problem:

python os.path.getctime max does not return latest Difference between python – getmtime() and getctime() in unix system


回答 5

(编辑以改善答案)

首先定义一个函数get_latest_file

def get_latest_file(path, *paths):
    fullpath = os.path.join(path, paths)
    ...
get_latest_file('example', 'files','randomtext011.*.txt')

您也可以使用文档字符串!

def get_latest_file(path, *paths):
    """Returns the name of the latest (most recent) file 
    of the joined path(s)"""
    fullpath = os.path.join(path, *paths)

如果使用Python 3,则可以改用iglob

完成代码以返回最新文件的名称:

def get_latest_file(path, *paths):
    """Returns the name of the latest (most recent) file 
    of the joined path(s)"""
    fullpath = os.path.join(path, *paths)
    files = glob.glob(fullpath)  # You may use iglob in Python3
    if not files:                # I prefer using the negation
        return None                      # because it behaves like a shortcut
    latest_file = max(files, key=os.path.getctime)
    _, filename = os.path.split(latest_file)
    return filename

(Edited to improve answer)

First define a function get_latest_file

def get_latest_file(path, *paths):
    fullpath = os.path.join(path, paths)
    ...
get_latest_file('example', 'files','randomtext011.*.txt')

You may also use a docstring !

def get_latest_file(path, *paths):
    """Returns the name of the latest (most recent) file 
    of the joined path(s)"""
    fullpath = os.path.join(path, *paths)

If you use Python 3, you can use iglob instead.

Complete code to return the name of latest file:

def get_latest_file(path, *paths):
    """Returns the name of the latest (most recent) file 
    of the joined path(s)"""
    fullpath = os.path.join(path, *paths)
    files = glob.glob(fullpath)  # You may use iglob in Python3
    if not files:                # I prefer using the negation
        return None                      # because it behaves like a shortcut
    latest_file = max(files, key=os.path.getctime)
    _, filename = os.path.split(latest_file)
    return filename

回答 6

我试图使用以上建议,但程序崩溃了,而不是我想识别的文件已被使用,并且在尝试使用“ os.path.getctime”时崩溃了。最终对我有用的是:

    files_before = glob.glob(os.path.join(my_path,'*'))
    **code where new file is created**
    new_file = set(files_before).symmetric_difference(set(glob.glob(os.path.join(my_path,'*'))))

此代码获取了两组文件列表之间最常见的对象,它并不是最优雅的对象,如果同时创建多个文件,则可能会不稳定

I have tried to use the above suggestions and my program crashed, than I figured out the file I’m trying to identify was used and when trying to use ‘os.path.getctime’ it crashed. what finally worked for me was:

    files_before = glob.glob(os.path.join(my_path,'*'))
    **code where new file is created**
    new_file = set(files_before).symmetric_difference(set(glob.glob(os.path.join(my_path,'*'))))

this codes gets the uncommon object between the two sets of file lists its not the most elegant, and if multiple files are created at the same time it would probably won’t be stable


回答 7

在Windows(0.05s)上更快的方法是,调用执行此操作的bat脚本:

get_latest.bat

@echo off
for /f %%i in ('dir \\directory\in\question /b/a-d/od/t:c') do set LAST=%%i
%LAST%

\\directory\in\question您要调查的目录在哪里。

get_latest.py

from subprocess import Popen, PIPE
p = Popen("get_latest.bat", shell=True, stdout=PIPE,)
stdout, stderr = p.communicate()
print(stdout, stderr)

如果找到文件stdout是路径,stderr则为None。

使用stdout.decode("utf-8").rstrip()来获取文件名使用字符串表示。

A much faster method on windows (0.05s), call a bat script that does this:

get_latest.bat

@echo off
for /f %%i in ('dir \\directory\in\question /b/a-d/od/t:c') do set LAST=%%i
%LAST%

where \\directory\in\question is the directory you want to investigate.

get_latest.py

from subprocess import Popen, PIPE
p = Popen("get_latest.bat", shell=True, stdout=PIPE,)
stdout, stderr = p.communicate()
print(stdout, stderr)

if it finds a file stdout is the path and stderr is None.

Use stdout.decode("utf-8").rstrip() to get the usable string representation of the file name.


回答 8

我在Python 3中一直在使用它,包括在文件名上进行模式匹配。

from pathlib import Path

def latest_file(path: Path, pattern: str = "*"):
    files = path.glob(pattern)
    return max(files, key=lambda x: x.stat().st_ctime)

I’ve been using this in Python 3, including pattern matching on the filename.

from pathlib import Path

def latest_file(path: Path, pattern: str = "*"):
    files = path.glob(pattern)
    return max(files, key=lambda x: x.stat().st_ctime)

熊猫中的datetime dtypes read_csv

问题:熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

运行时出现错误:

TypeError:不了解数据类型“ datetime”

事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。

另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

I’m reading in a csv file with multiple datetime columns. I’d need to set the data types upon reading in the file, but datetimes appear to be a problem. For instance:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

When run gives a error:

TypeError: data type “datetime” not understood

Converting columns after the fact, via pandas.to_datetime() isn’t an option I can’t know which columns will be datetime objects. That information can change and comes from whatever informs my dtypes list.

Alternatively, I’ve tried to load the csv file with numpy.genfromtxt, set the dtypes in that function, and then convert to a pandas.dataframe but it garbles the data. Any help is greatly appreciated!


回答 0

为什么它不起作用

没有为read_csv设置datetime dtype,因为csv文件只能包含字符串,整数和浮点数。

将dtype设置为datetime将使熊猫将datetime解释为对象,这意味着您将以字符串结尾。

熊猫解决这个问题的方法

pandas.read_csv()函数具有名为parse_dates

使用此功能,您可以使用默认date_parserdateutil.parser.parser)快速将字符串,浮点数或整数转换为日期时间

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

这将导致熊猫读取col1col2作为字符串,它们很可能是字符串(“ 2016-05-05”等),并且在读取字符串之后,每一列的date_parser都会对该字符串起作用,并返回该函数返回的任何内容。

定义自己的日期解析功能:

pandas.read_csv()函数具有名为date_parser

将其设置为lambda函数将使该特定函数可用于日期解析。

GOTCHA警告

您必须为其提供功能,而不是功能的执行,因此这是正确的

date_parser = pd.datetools.to_datetime

这是不正确的

date_parser = pd.datetools.to_datetime()

熊猫0.22更新

pd.datetools.to_datetime 已移至 date_parser = pd.to_datetime

谢谢@stackoverYC

Why it does not work

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats.

Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

Pandas way of solving this

The pandas.read_csv() function has a keyword argument called parse_dates

Using this you can on the fly convert strings, floats or integers into datetimes using the default date_parser (dateutil.parser.parser)

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

This will cause pandas to read col1 and col2 as strings, which they most likely are (“2016-05-05” etc.) and after having read the string, the date_parser for each column will act upon that string and give back whatever that function returns.

Defining your own date parsing function:

The pandas.read_csv() function also has a keyword argument called date_parser

Setting this to a lambda function will make that particular function be used for the parsing of the dates.

GOTCHA WARNING

You have to give it the function, not the execution of the function, thus this is Correct

date_parser = pd.datetools.to_datetime

This is incorrect:

date_parser = pd.datetools.to_datetime()

Pandas 0.22 Update

pd.datetools.to_datetime has been relocated to date_parser = pd.to_datetime

Thanks @stackoverYC


回答 1

有一个parse_dates参数read_csv可让您定义要视为日期或日期时间的列的名称:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

回答 2

您可以尝试传递实际类型而不是字符串。

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

但是,如果没有任何可修改的数据,将很难诊断出来。

实际上,您可能希望熊猫将日期解析为时间戳记,因此可能是:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

You might try passing actual types instead of strings.

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

But it’s going to be really hard to diagnose this without any of your data to tinker with.

And really, you probably want pandas to parse the the dates into TimeStamps, so that might be:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

回答 3

我尝试使用dtypes = [datetime,…]选项,但是

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

我遇到以下错误:

TypeError: data type not understood

我唯一要做的更改是将datetime替换为datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I tried using the dtypes=[datetime, …] option, but

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I encountered the following error:

TypeError: data type not understood

The only change I had to make is to replace datetime with datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

合并PDF文件

问题:合并PDF文件

是否可以使用Python合并单独的PDF文件?

假设是这样,我需要进一步扩展。我希望遍历目录中的文件夹并重复此过程。

我可能会碰运气,但是有可能排除PDF中包含的页面(我的报告生成总是创建一个额外的空白页面)。

Is it possible, using Python, to merge separate PDF files?

Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.

And I may be pushing my luck, but is it possible to exclude a page that is contained in of the PDFs (my report generation always creates an extra blank page).


回答 0

使用Pypdf或其后续版本PyPDF2

作为Python工具箱构建的Pure-Python库。它具有以下功能:
*逐页拆分文档,
* 逐页合并文档,

(以及更多)

这是适用于两个版本的示例程序。

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # /programming/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

Use Pypdf or its successor PyPDF2:

A Pure-Python library built as a PDF toolkit. It is capable of:
* splitting documents page by page,
* merging documents page by page,

(and much more)

Here’s a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

回答 1

您可以使用PyPdf2PdfMerger类。

文件串联

您可以使用方法简单地串联文件append

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

您可以根据需要传递文件句柄而不是文件路径。

文件合并

如果要更精细地控制合并,可以使用的merge方法,该方法PdfMerger可以在输出文件中指定插入点,这意味着您可以将页面插入文件中的任何位置。该append方法可以认为是merge插入点位于文件末尾的位置。

例如

merger.merge(2, pdf)

在这里,我们将整个pdf插入到输出中,但在第2页。

页面范围

如果要控制从特定文件追加哪些页面,可以使用and 的pages关键字参数,以格式传递元组(类似于常规函数)。appendmerge(start, stop[, step])range

例如

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

如果指定的范围无效,则会显示IndexError

注意:此外,为避免文件保持打开状态,在PdfFileMerger写入合并文件后应调用s close方法。这样可确保及时关闭所有文件(输入和输出)。遗憾的PdfFileMerger是没有作为上下文管理器来实现,因此我们可以使用with关键字,避免显式的close调用并获得一些简单的异常安全性。

您可能还需要查看pdfcatpypdf2中提供的脚本。您可以完全避免编写代码。

PyPdf2 github还包括一些示例代码,展示了合并。

You can use PyPdf2s PdfMerger class.

File Concatenation

You can simply concatenate files by using the append method.

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

You can pass file handles instead file paths if you want.

File Merging

If you want more fine grained control of merging there is a merge method of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The append method can be thought of as a merge where the insertion point is the end of the file.

e.g.

merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.

Page Ranges

If you wish to control which pages are appended from a particular file, you can use the pages keyword argument of append and merge, passing a tuple in the form (start, stop[, step]) (like the regular range function).

e.g.

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError.

Note: also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It’s a shame that PdfFileMerger isn’t implemented as a context manager, so we can use the with keyword, avoid the explicit close call and get some easy exception safety.

You might also want to look at the pdfcat script provided as part of pypdf2. You can potentially avoid the need to write code altogether.

The PyPdf2 github also includes some example code demonstrating merging.


回答 2

合并目录中存在的所有pdf文件

将pdf文件放在目录中。启动程序。您将合并所有pdf文件,得到一个pdf文件。

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

Merge all pdf files that are present in a dir

Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

回答 3

假设您不需要保留书签和注释,并且您的PDF未被加密,该pdfrw可以非常轻松地做到这一点。 cat.py是示例串联脚本,并且subset.py是示例页面子设置脚本。

串联脚本的相关部分-假设inputs是输入文件名列表,并且outfn是输出文件名:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

从中可以看出,省去最后一页非常容易,例如:

    writer.addpages(PdfReader(inpfn).pages[:-1])

免责声明:我是第一pdfrw作者。

The pdfrw library can do this quite easily, assuming you don’t need to preserve bookmarks and annotations, and your PDFs aren’t encrypted. cat.py is an example concatenation script, and subset.py is an example page subsetting script.

The relevant part of the concatenation script — assumes inputs is a list of input filenames, and outfn is an output file name:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:

    writer.addpages(PdfReader(inpfn).pages[:-1])

Disclaimer: I am the primary pdfrw author.


回答 4

是否可以使用Python合并单独的PDF文件?

是。

以下示例将一个文件夹中的所有文件合并为一个新的PDF文件:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)

Is it possible, using Python, to merge seperate PDF files?

Yes.

The following example merges all files in one folder to a single new PDF file:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)

回答 5

from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Git回购:https : //github.com/mahaguru24/Python_Merge_PDF.git

from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git


回答 6

在这里,http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/提供了解决方案。

类似地:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))

here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.

similarly:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))

回答 7

使用字典进行一些细微的改动以获得更大的灵活性(例如,sort,dedup):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

A slight variation using a dictionary for greater flexibility (e.g. sort, dedup):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

回答 8

我通过利用子进程在Linux终端上使用pdf unite(假设目录中存在one.pdf和two.pdf),目的是将它们合并为3.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

python numpy ValueError:操作数不能与形状一起广播

问题:python numpy ValueError:操作数不能与形状一起广播

在numpy中,我有两个“数组”,X(m,n)y是一个向量(n,1)

使用

X*y

我收到错误

ValueError: operands could not be broadcast together with shapes (97,2) (2,1) 

何时 (97,2)x(2,1)显然是合法的矩阵运算,应该给我一个(97,1)向量

编辑:

我已使用纠正了此X.dot(y)问题,但原始问题仍然存在。

In numpy, I have two “arrays”, X is (m,n) and y is a vector (n,1)

using

X*y

I am getting the error

ValueError: operands could not be broadcast together with shapes (97,2) (2,1) 

When (97,2)x(2,1) is clearly a legal matrix operation and should give me a (97,1) vector

EDIT:

I have corrected this using X.dot(y) but the original question still remains.


回答 0

dot是矩阵乘法,但*还有其他功能。

我们有两个数组:

  • X,形状(97,2)
  • y,形状(2,1)

使用Numpy数组时,该操作

X * y

可以逐个元素地完成,但是可以在一个或多个维度上扩展其中一个或两个值以使其兼容。此操作称为广播。尺寸为1或缺少尺寸的尺寸可用于广播。

在上面的示例中,尺寸不兼容,因为:

97   2
 2   1

此处在第一维中有冲突的数字(97和2)。这就是上面的ValueError所抱怨的。第二个维度可以,因为数字1不会与任何东西冲突。

有关广播规则的更多信息,请访问:http : //docs.scipy.org/doc/numpy/user/basics.broadcasting.html

(请注意,如果Xy的类型为numpy.matrix,则星号可以用作矩阵乘法。我的建议是远离numpy.matrix,它会使事情变得复杂,而不是简化事情。)

您的数组应该可以使用numpy.dot; 如果您在遇到错误numpy.dot,则必须有其他错误。如果形状不适用于numpy.dot,则会得到另一个异常:

ValueError: matrices are not aligned

如果仍然出现此错误,请发布问题的最小示例。与形状像您的数组的乘法示例成功:

In [1]: import numpy

In [2]: numpy.dot(numpy.ones([97, 2]), numpy.ones([2, 1])).shape
Out[2]: (97, 1)

dot is matrix multiplication, but * does something else.

We have two arrays:

  • X, shape (97,2)
  • y, shape (2,1)

With Numpy arrays, the operation

X * y

is done element-wise, but one or both of the values can be expanded in one or more dimensions to make them compatible. This operation are called broadcasting. Dimensions where size is 1 or which are missing can be used in broadcasting.

In the example above the dimensions are incompatible, because:

97   2
 2   1

Here there are conflicting numbers in the first dimension (97 and 2). That is what the ValueError above is complaining about. The second dimension would be ok, as number 1 does not conflict with anything.

For more information on broadcasting rules: http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

(Please note that if X and y are of type numpy.matrix, then asterisk can be used as matrix multiplication. My recommendation is to keep away from numpy.matrix, it tends to complicate more than simplify things.)

Your arrays should be fine with numpy.dot; if you get an error on numpy.dot, you must have some other bug. If the shapes are wrong for numpy.dot, you get a different exception:

ValueError: matrices are not aligned

If you still get this error, please post a minimal example of the problem. An example multiplication with arrays shaped like yours succeeds:

In [1]: import numpy

In [2]: numpy.dot(numpy.ones([97, 2]), numpy.ones([2, 1])).shape
Out[2]: (97, 1)

回答 1

每个numpy文档

在两个数组上运算时,NumPy逐元素比较其形状。它从尾随尺寸开始,一直向前发展。在以下情况下,两个维度兼容:

  • 它们相等,或者
  • 其中之一是1

换句话说,如果您试图将两个矩阵相乘(在线性代数意义上),那么您就想要,X.dot(y)但是如果您尝试将标量从矩阵广播y到,X则需要执行X * y.T

例:

>>> import numpy as np
>>>
>>> X = np.arange(8).reshape(4, 2)
>>> y = np.arange(2).reshape(1, 2)  # create a 1x2 matrix
>>> X * y
array([[0,1],
       [0,3],
       [0,5],
       [0,7]])

Per numpy docs:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when:

  • they are equal, or
  • one of them is 1

In other words, if you are trying to multiply two matrices (in the linear algebra sense) then you want X.dot(y) but if you are trying to broadcast scalars from matrix y onto X then you need to perform X * y.T.

Example:

>>> import numpy as np
>>>
>>> X = np.arange(8).reshape(4, 2)
>>> y = np.arange(2).reshape(1, 2)  # create a 1x2 matrix
>>> X * y
array([[0,1],
       [0,3],
       [0,5],
       [0,7]])

回答 2

该错误可能不是在点积中发生的,而是在此之后发生的。例如试试这个

a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b) * c

np.dot(a,b)会很好的;但是np.dot(a,b)* c显然是错误的(12×1 X 1×5 = 12×5不能逐个元素乘以5×12),但是numpy会给你

ValueError: operands could not be broadcast together with shapes (12,1) (1,5)

错误是令人误解的。但是那条线上有一个问题。

It’s possible that the error didn’t occur in the dot product, but after. For example try this

a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b) * c

np.dot(a,b) will be fine; however np.dot(a, b) * c is clearly wrong (12×1 X 1×5 = 12×5 which cannot element-wise multiply 5×12) but numpy will give you

ValueError: operands could not be broadcast together with shapes (12,1) (1,5)

The error is misleading; however there is an issue on that line.


回答 3

使用np.mat(x) * np.mat(y),它将起作用。

Use np.mat(x) * np.mat(y), that’ll work.


回答 4

您正在寻找np.matmul(X, y)。在Python 3.5+中,您可以使用X @ y

You are looking for np.matmul(X, y). In Python 3.5+ you can use X @ y.


回答 5

我们可能会混淆a * b是点积。

但实际上,它是广播。

点积: a.dot(b)

广播:

术语广播是指numpy在算术运算期间如何处理具有不同尺寸的数组,这会导致某些约束,较小的数组将在较大的数组上广播,以使它们具有兼容的形状。

(m,n)+-/ *(1,n)→(m,n):该操作将应用于m行

We might confuse ourselves that a * b is a dot product.

But in fact, it is broadcast.

Dot Product : a.dot(b)

Broadcast:

The term broadcasting refers to how numpy treats arrays with different dimensions during arithmetic operations which lead to certain constraints, the smaller array is broadcast across the larger array so that they have compatible shapes.

(m,n) +-/* (1,n) → (m,n) : the operation will be applied to m rows


有趣好用的Python教程

退出移动版
微信支付
请使用 微信 扫码支付