如何在python中使用Selenium Webdriver滚动网页?

问题:如何在python中使用Selenium Webdriver滚动网页?

我目前正在使用Selenium Webdriver通过Facebook用户朋友页面进行解析,并从AJAX脚本中提取所有ID。但是我需要向下滚动才能得到所有的朋友。如何在Selenium中向下滚动。我正在使用python。

I am currently using selenium webdriver to parse through facebook user friends page and extract all ids from the AJAX script. But I need to scroll down to get all the friends. How can I scroll down in Selenium. I am using python.


回答 0

您可以使用

driver.execute_script("window.scrollTo(0, Y)") 

其中Y是高度(在全高清显示器上为1080)。(感谢@lukeis)

您也可以使用

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

滚动到页面底部。

如果您想滚动到无限加载的页面,例如社交网络页面,facebook等(感谢@Cuong Tran)

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

另一种方法(感谢Juanse)是,选择一个对象,然后

label.sendKeys(Keys.PAGE_DOWN);

You can use

driver.execute_script("window.scrollTo(0, Y)") 

where Y is the height (on a fullhd monitor it’s 1080). (Thanks to @lukeis)

You can also use

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

to scroll to the bottom of the page.

If you want to scroll to a page with infinite loading, like social network ones, facebook etc. (thanks to @Cuong Tran)

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

another method (thanks to Juanse) is, select an object and

label.sendKeys(Keys.PAGE_DOWN);

回答 1

如果要向下滚动到无限页面的底部(例如linkedin.com),可以使用以下代码:

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

参考:https : //stackoverflow.com/a/28928684/1316860

If you want to scroll down to bottom of infinite page (like linkedin.com), you can use this code:

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

Reference: https://stackoverflow.com/a/28928684/1316860


回答 2

您可以send_keys用来模拟END(或PAGE_DOWN)按键(通常会滚动页面):

from selenium.webdriver.common.keys import Keys
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)

You can use send_keys to simulate an END (or PAGE_DOWN) key press (which normally scroll the page):

from selenium.webdriver.common.keys import Keys
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)

回答 3

如图相同的方法在这里

在python中,您可以使用

driver.execute_script("window.scrollTo(0, Y)")

(Y是您要滚动到的垂直位置)

same method as shown here:

in python you can just use

driver.execute_script("window.scrollTo(0, Y)")

(Y is the vertical position you want to scroll to)


回答 4

element=find_element_by_xpath("xpath of the li you are trying to access")

element.location_once_scrolled_into_view

当我尝试访问不可见的“ li”时,这很有帮助。

element=find_element_by_xpath("xpath of the li you are trying to access")

element.location_once_scrolled_into_view

this helped when I was trying to access a ‘li’ that was not visible.


回答 5

出于我的目的,我想向下滚动更多,同时牢记窗口的位置。我的解决方案是相似的,并使用window.scrollY

driver.execute_script("window.scrollTo(0, window.scrollY + 200)")

它将转到当前的y滚动位置+ 200

For my purpose, I wanted to scroll down more, keeping the windows position in mind. My solution was similar and used window.scrollY

driver.execute_script("window.scrollTo(0, window.scrollY + 200)")

which will go to the current y scroll position + 200


回答 6

这是您向下滚动网页的方式:

driver.execute_script("window.scrollTo(0, 1000);")

This is how you scroll down the webpage:

driver.execute_script("window.scrollTo(0, 1000);")

回答 7

我发现解决该问题的最简单方法是选择一个标签,然后发送:

label.sendKeys(Keys.PAGE_DOWN);

希望它能起作用!

The easiest way i found to solve that problem was to select a label and then send:

label.sendKeys(Keys.PAGE_DOWN);

Hope it works!


回答 8

这些答案都不适合我,至少不是向下滚动Facebook搜索结果页面有效,但经过大量测试,我发现此解决方案:

while driver.find_element_by_tag_name('div'):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    Divs=driver.find_element_by_tag_name('div').text
    if 'End of Results' in Divs:
        print 'end'
        break
    else:
        continue

None of these answers worked for me, at least not for scrolling down a facebook search result page, but I found after a lot of testing this solution:

while driver.find_element_by_tag_name('div'):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    Divs=driver.find_element_by_tag_name('div').text
    if 'End of Results' in Divs:
        print 'end'
        break
    else:
        continue

回答 9

使用youtube时,浮动元素的滚动高度为“ 0”,因此请不要使用“ return document.body.scrollHeight”,而是尝试使用此“ return document.documentElement.scrollHeight” ,根据您的互联网调整滚动暂停时间速度,否则它将只运行一次,然后在此之后中断。

SCROLL_PAUSE_TIME = 1

# Get scroll height
"""last_height = driver.execute_script("return document.body.scrollHeight")

this dowsnt work due to floating web elements on youtube
"""

last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0,document.documentElement.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
       print("break")
       break
    last_height = new_height

When working with youtube the floating elements give the value “0” as the scroll height so rather than using “return document.body.scrollHeight” try using this one “return document.documentElement.scrollHeight” adjust the scroll pause time as per your internet speed else it will run for only one time and then breaks after that.

SCROLL_PAUSE_TIME = 1

# Get scroll height
"""last_height = driver.execute_script("return document.body.scrollHeight")

this dowsnt work due to floating web elements on youtube
"""

last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0,document.documentElement.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
       print("break")
       break
    last_height = new_height

回答 10

我正在寻找一种滚动浏览动态网页的方法,并在到达页面末尾并发现该线程时自动停止。

@Cuong Tran的帖子进行了主要修改,是我正在寻找的答案。我认为其他人可能会发现此修改很有用(它对代码的工作方式有明显影响),因此,本文发布了。

修改是移动捕获循环最后一页高度的语句(以便使每项检查都与上一页高度进行比较)。

因此,下面的代码:

连续向下滚动动态网页(.scrollTo()),仅在一次迭代中页面高度保持不变时停止。

(还有另一种修改,其中break语句位于另一个可以删除的条件内(如果页面为“ sticks”)。

    SCROLL_PAUSE_TIME = 0.5


    while True:

        # Get scroll height
        ### This is the difference. Moving this *inside* the loop
        ### means that it checks if scrollTo is still scrolling 
        last_height = driver.execute_script("return document.body.scrollHeight")

        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:

            # try again (can be removed)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # Wait to load page
            time.sleep(SCROLL_PAUSE_TIME)

            # Calculate new scroll height and compare with last scroll height
            new_height = driver.execute_script("return document.body.scrollHeight")

            # check if the page height has remained the same
            if new_height == last_height:
                # if so, you are done
                break
            # if not, move on to the next loop
            else:
                last_height = new_height
                continue

I was looking for a way of scrolling through a dynamic webpage, and automatically stopping once the end of the page is reached, and found this thread.

The post by @Cuong Tran, with one main modification, was the answer that I was looking for. I thought that others might find the modification helpful (it has a pronounced effect on how the code works), hence this post.

The modification is to move the statement that captures the last page height inside the loop (so that each check is comparing to the previous page height).

So, the code below:

Continuously scrolls down a dynamic webpage (.scrollTo()), only stopping when, for one iteration, the page height stays the same.

(There is another modification, where the break statement is inside another condition (in case the page ‘sticks’) which can be removed).

    SCROLL_PAUSE_TIME = 0.5


    while True:

        # Get scroll height
        ### This is the difference. Moving this *inside* the loop
        ### means that it checks if scrollTo is still scrolling 
        last_height = driver.execute_script("return document.body.scrollHeight")

        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:

            # try again (can be removed)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # Wait to load page
            time.sleep(SCROLL_PAUSE_TIME)

            # Calculate new scroll height and compare with last scroll height
            new_height = driver.execute_script("return document.body.scrollHeight")

            # check if the page height has remained the same
            if new_height == last_height:
                # if so, you are done
                break
            # if not, move on to the next loop
            else:
                last_height = new_height
                continue

回答 11

该代码滚动到底部,但不需要您每次都等待。它会不断滚动,然后在底部停止(或超时)

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get('https://example.com')

pre_scroll_height = driver.execute_script('return document.body.scrollHeight;')
run_time, max_run_time = 0, 1
while True:
    iteration_start = time.time()
    # Scroll webpage, the 100 allows for a more 'aggressive' scroll
    driver.execute_script('window.scrollTo(0, 100*document.body.scrollHeight);')

    post_scroll_height = driver.execute_script('return document.body.scrollHeight;')

    scrolled = post_scroll_height != pre_scroll_height
    timed_out = run_time >= max_run_time

    if scrolled:
        run_time = 0
        pre_scroll_height = post_scroll_height
    elif not scrolled and not timed_out:
        run_time += time.time() - iteration_start
    elif not scrolled and timed_out:
        break

# closing the driver is optional 
driver.close()

这比每次等待0.5-3秒等待响应要快得多,因为该响应可能需要0.1秒

This code scrolls to the bottom but doesn’t require that you wait each time. It’ll continually scroll, and then stop at the bottom (or timeout)

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get('https://example.com')

pre_scroll_height = driver.execute_script('return document.body.scrollHeight;')
run_time, max_run_time = 0, 1
while True:
    iteration_start = time.time()
    # Scroll webpage, the 100 allows for a more 'aggressive' scroll
    driver.execute_script('window.scrollTo(0, 100*document.body.scrollHeight);')

    post_scroll_height = driver.execute_script('return document.body.scrollHeight;')

    scrolled = post_scroll_height != pre_scroll_height
    timed_out = run_time >= max_run_time

    if scrolled:
        run_time = 0
        pre_scroll_height = post_scroll_height
    elif not scrolled and not timed_out:
        run_time += time.time() - iteration_start
    elif not scrolled and timed_out:
        break

# closing the driver is optional 
driver.close()

This is much faster than waiting 0.5-3 seconds each time for a response, when that response could take 0.1 seconds


回答 12

滚动加载页面。示例:中,定额等

last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-1000);")
        # Wait to load the page.
        driver.implicitly_wait(30) # seconds
        new_height = driver.execute_script("return document.body.scrollHeight")
    
        if new_height == last_height:
            break
        last_height = new_height
        # sleep for 30s
        driver.implicitly_wait(30) # seconds
    driver.quit()

scroll loading pages. Example: medium, quora,etc

last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-1000);")
        # Wait to load the page.
        driver.implicitly_wait(30) # seconds
        new_height = driver.execute_script("return document.body.scrollHeight")
    
        if new_height == last_height:
            break
        last_height = new_height
        # sleep for 30s
        driver.implicitly_wait(30) # seconds
    driver.quit()

回答 13

如果要在特定视图/框架(WebElement)中滚动,则只需将“ body”替换为要在其中滚动的特定元素。我在下面的示例中通过“ getElementById”获得该元素:

self.driver.execute_script('window.scrollTo(0, document.getElementById("page-manager").scrollHeight);')

例如,在YouTube上就是这种情况。

if you want to scroll within a particular view/frame (WebElement), what you only need to do is to replace “body” with a particular element that you intend to scroll within. i get that element via “getElementById” in the example below:

self.driver.execute_script('window.scrollTo(0, document.getElementById("page-manager").scrollHeight);')

this is the case on YouTube, for example…


回答 14

ScrollTo()功能不再起作用。这是我使用的,效果很好。

driver.execute_script("document.getElementById('mydiv').scrollIntoView();")

The ScrollTo() function doesn’t work anymore. This is what I used and it worked fine.

driver.execute_script("document.getElementById('mydiv').scrollIntoView();")

回答 15

driver.execute_script("document.getElementById('your ID Element').scrollIntoView();")

它适合我的情况。

driver.execute_script("document.getElementById('your ID Element').scrollIntoView();")

it’s working for my case.


我可以使用__init__.py定义全局变量吗?

问题:我可以使用__init__.py定义全局变量吗?

我想定义一个常量,该常量应在包的所有子模块中可用。我以为最好的地方__init__.py在根包的文件中。但是我不知道该怎么做。假设我有几个子包,每个子包都有几个模块。如何从这些模块访问该变量?

当然,如果这是完全错误的,并且有更好的选择,我想知道。

I want to define a constant that should be available in all of the submodules of a package. I’ve thought that the best place would be in in the __init__.py file of the root package. But I don’t know how to do this. Suppose I have a few subpackages and each with several modules. How can I access that variable from these modules?

Of course, if this is totally wrong, and there is a better alternative, I’d like to know it.


回答 0

您应该能够将它们放入__init__.py。这一直都在做。

mypackage/__init__.py

MY_CONSTANT = 42

mypackage/mymodule.py

from mypackage import MY_CONSTANT
print "my constant is", MY_CONSTANT

然后,导入mymodule:

>>> from mypackage import mymodule
my constant is 42

不过,如果您确实有常量,将它们放在单独的模块(constants.py,config.py,…)中,然后将其放入包命名空间中是合理的(可能是最佳做法),然后导入他们。

mypackage/__init__.py

from mypackage.constants import *

尽管如此,这并不会自动在包模块的命名空间中包含常量。包中的每个模块仍然必须从mypackage或从中显式导入常量mypackage.constants

You should be able to put them in __init__.py. This is done all the time.

mypackage/__init__.py:

MY_CONSTANT = 42

mypackage/mymodule.py:

from mypackage import MY_CONSTANT
print "my constant is", MY_CONSTANT

Then, import mymodule:

>>> from mypackage import mymodule
my constant is 42

Still, if you do have constants, it would be reasonable (best practices, probably) to put them in a separate module (constants.py, config.py, …) and then if you want them in the package namespace, import them.

mypackage/__init__.py:

from mypackage.constants import *

Still, this doesn’t automatically include the constants in the namespaces of the package modules. Each of the modules in the package will still have to import constants explicitly either from mypackage or from mypackage.constants.


回答 1

你不能这样做。您必须将常量明确地导入每个模块的命名空间中。实现此目的的最佳方法是在“ config”模块中定义常量,并将其导入所需的任何位置:

# mypackage/config.py
MY_CONST = 17

# mypackage/main.py
from mypackage.config import *

You cannot do that. You will have to explicitely import your constants into each individual module’s namespace. The best way to achieve this is to define your constants in a “config” module and import it everywhere you require it:

# mypackage/config.py
MY_CONST = 17

# mypackage/main.py
from mypackage.config import *

回答 2

您可以在任何地方定义全局变量,但这是一个非常糟糕的主意。导入__builtin__模块并修改或向该模块添加属性,突然间您有了新的内置常量或函数。实际上,当我的应用程序安装gettext时,我在所有模块中都获得了_()函数,而无需导入任何内容。因此这是可能的,但当然仅适用于应用程序类型的项目,而不适用于可重用的包或模块。

而且我想没人会推荐这种做法。命名空间有什么问题?说应用程序版本模块,这样我可以有像“全局”变量version.VERSIONversion.PACKAGE_NAME等等。

You can define global variables from anywhere, but it is a really bad idea. import the __builtin__ module and modify or add attributes to this modules, and suddenly you have new builtin constants or functions. In fact, when my application installs gettext, I get the _() function in all my modules, without importing anything. So this is possible, but of course only for Application-type projects, not for reusable packages or modules.

And I guess no one would recommend this practice anyway. What’s wrong with a namespace? Said application has the version module, so that I have “global” variables available like version.VERSION, version.PACKAGE_NAME etc.


回答 3

只是想补充一点,可以使用config.ini文件使用常量,并使用configparser库在脚本中对其进行解析。这样,您可以在多种情况下使用常量。例如,如果您有两个单独的url请求的参数常量,只需将它们标记为:

mymodule/config.ini
[request0]
conn = 'admin@localhost'
pass = 'admin'
...

[request1]
conn = 'barney@localhost'
pass = 'dinosaur'
...

我发现Python网站上的文档非常有帮助。我不确定Python 2和3之间是否有任何区别,因此这是两者的链接:

对于Python 3:https//docs.python.org/3/library/configparser.html#module-configparser

对于Python 2:https//docs.python.org/2/library/configparser.html#module-configparser

Just wanted to add that constants can be employed using a config.ini file and parsed in the script using the configparser library. This way you could have constants for multiple circumstances. For instance if you had parameter constants for two separate url requests just label them like so:

mymodule/config.ini
[request0]
conn = 'admin@localhost'
pass = 'admin'
...

[request1]
conn = 'barney@localhost'
pass = 'dinosaur'
...

I found the documentation on the Python website very helpful. I am not sure if there are any differences between Python 2 and 3 so here are the links to both:

For Python 3: https://docs.python.org/3/library/configparser.html#module-configparser

For Python 2: https://docs.python.org/2/library/configparser.html#module-configparser


检查环境变量是否存在的良好实践是什么?

问题:检查环境变量是否存在的良好实践是什么?

我想检查我的环境中是否存在"FOO"Python 中的变量。为此,我正在使用os标准库。阅读图书馆的文档后,我想出了两种实现目标的方法:

方法1:

if "FOO" in os.environ:
    pass

方法2:

if os.getenv("FOO") is not None:
    pass

我想知道哪种方法是好的/首选条件,以及为什么。

I want to check my environment for the existence of a variable, say "FOO", in Python. For this purpose, I am using the os standard library. After reading the library’s documentation, I have figured out 2 ways to achieve my goal:

Method 1:

if "FOO" in os.environ:
    pass

Method 2:

if os.getenv("FOO") is not None:
    pass

I would like to know which method, if either, is a good/preferred conditional and why.


回答 0

使用第一个;它直接尝试检查是否在中定义了某些内容environ。尽管第二种形式同样可以很好地工作,但是它在语义上是不足的,因为如果存在,您会得到一个返回的值,并且将其用于比较。

你想看看是否有存在 environ,为什么你会得到只是为了进行比较,然后折腾它扔掉

那正是这样getenv做的:

获取一个环境变量None如果不存在则返回。可选的第二个参数可以指定备用默认值。

(这也意味着您的支票可能只是if getenv("FOO")

你不想得到它,你想检查它的存在。

无论哪种方式,getenv都只是一个包装,environ.get但是您看不到有人通过以下方式检查映射中的成员身份:

from os import environ
if environ.get('Foo') is not None:

总结一下,使用:

if "FOO" in os.environ:
    pass

如果您只想检查是否存在,请使用,getenv("FOO")如果您确实想用可能获得的价值做某事。

Use the first; it directly tries to check if something is defined in environ. Though the second form works equally well, it’s lacking semantically since you get a value back if it exists and only use it for a comparison.

You’re trying to see if something is present in environ, why would you get just to compare it and then toss it away?

That’s exactly what getenv does:

Get an environment variable, return None if it doesn’t exist. The optional second argument can specify an alternate default.

(this also means your check could just be if getenv("FOO"))

you don’t want to get it, you want to check for it’s existence.

Either way, getenv is just a wrapper around environ.get but you don’t see people checking for membership in mappings with:

from os import environ
if environ.get('Foo') is not None:

To summarize, use:

if "FOO" in os.environ:
    pass

if you just want to check for existence, while, use getenv("FOO") if you actually want to do something with the value you might get.


回答 1

两种解决方案都有一种情况,这取决于您要根据环境变量的存在来执行什么操作。

情况1

如果您想纯粹基于环境变量的存在而采取不同的措施而又不关心其价值,那么第一个解决方案就是最佳实践。它简要描述了您要测试的内容:环境变量列表中的’FOO’。

if 'KITTEN_ALLERGY' in os.environ:
    buy_puppy()
else:
    buy_kitten()

情况二

如果您想在环境变量中未定义该值的情况下设置默认值,则第二个解决方案实际上很有用,尽管它不是您编写的形式:

server = os.getenv('MY_CAT_STREAMS', 'youtube.com')

也许

server = os.environ.get('MY_CAT_STREAMS', 'youtube.com')

请注意,如果您的应用程序有多个选项,则可能需要查看ChainMap,它允许根据键合并多个字典。ChainMap文档中有一个示例:

[...]
combined = ChainMap(command_line_args, os.environ, defaults)

There is a case for either solution, depending on what you want to do conditional on the existence of the environment variable.

Case 1

When you want to take different actions purely based on the existence of the environment variable, without caring for its value, the first solution is the best practice. It succinctly describes what you test for: is ‘FOO’ in the list of environment variables.

if 'KITTEN_ALLERGY' in os.environ:
    buy_puppy()
else:
    buy_kitten()

Case 2

When you want to set a default value if the value is not defined in the environment variables the second solution is actually useful, though not in the form you wrote it:

server = os.getenv('MY_CAT_STREAMS', 'youtube.com')

or perhaps

server = os.environ.get('MY_CAT_STREAMS', 'youtube.com')

Note that if you have several options for your application you might want to look into ChainMap, which allows to merge multiple dicts based on keys. There is an example of this in the ChainMap documentation:

[...]
combined = ChainMap(command_line_args, os.environ, defaults)

回答 2

为了安全起见

os.getenv('FOO') or 'bar'

上述答案的一个极端情况是设置了环境变量但为空

对于这种特殊情况,您会得到

print(os.getenv('FOO', 'bar'))
# prints new line - though you expected `bar`

要么

if "FOO" in os.environ:
    print("FOO is here")
# prints FOO is here - however its not

为了避免这种情况,只需使用 or

os.getenv('FOO') or 'bar'

然后你得到

print(os.getenv('FOO') or 'bar')
# bar

什么时候有空的环境变量?

您忘记在.env文件中设置值

# .env
FOO=

或导出为

$ export FOO=

或忘记设置它 settings.py

# settings.py
os.environ['FOO'] = ''

更新:如果有疑问,请查看这些单线

>>> import os; os.environ['FOO'] = ''; print(os.getenv('FOO', 'bar'))

$ FOO= python -c "import os; print(os.getenv('FOO', 'bar'))"

To be on the safe side use

os.getenv('FOO') or 'bar'

A corner case with the above answers is when the environment variable is set but is empty

For this special case you get

print(os.getenv('FOO', 'bar'))
# prints new line - though you expected `bar`

or

if "FOO" in os.environ:
    print("FOO is here")
# prints FOO is here - however its not

To avoid this just use or

os.getenv('FOO') or 'bar'

Then you get

print(os.getenv('FOO') or 'bar')
# bar

When do we have empty environment variables?

You forgot to set the value in the .env file

# .env
FOO=

or exported as

$ export FOO=

or forgot to set it in settings.py

# settings.py
os.environ['FOO'] = ''

Update: if in doubt, check out these one-liners

>>> import os; os.environ['FOO'] = ''; print(os.getenv('FOO', 'bar'))

$ FOO= python -c "import os; print(os.getenv('FOO', 'bar'))"

回答 3

如果您要检查是否未设置多个环境变量,可以执行以下操作:

import os

MANDATORY_ENV_VARS = ["FOO", "BAR"]

for var in MANDATORY_ENV_VARS:
    if var not in os.environ:
        raise EnvironmentError("Failed because {} is not set.".format(var))

In case you want to check if multiple env variables are not set, you can do the following:

import os

MANDATORY_ENV_VARS = ["FOO", "BAR"]

for var in MANDATORY_ENV_VARS:
    if var not in os.environ:
        raise EnvironmentError("Failed because {} is not set.".format(var))

回答 4

我的评论可能与给定的标签无关。但是,我是从搜索中转到此页面的。我一直在寻找R中的类似支票,并在@hugovdbeg帖子的帮助下提出了以下内容。我希望这对在R中寻求类似解决方案的人有所帮助

'USERNAME' %in% names(Sys.getenv())

My comment might not be relevant to the tags given. However, I was lead to this page from my search. I was looking for similar check in R and I came up the following with the help of @hugovdbeg post. I hope it would be helpful for someone who is looking for similar solution in R

'USERNAME' %in% names(Sys.getenv())

提取正则表达式匹配项的一部分

问题:提取正则表达式匹配项的一部分

我想要一个正则表达式从HTML页面提取标题。目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

是否有一个正则表达式仅提取<title>的内容,所以我不必删除标签?

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?


回答 0

( )在正则表达式和group(1)python中检索捕获的字符串(re.search将返回None如果没有找到结果,所以不要用group()直接):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

回答 1

请注意,通过开始Python 3.8并引入赋值表达式(PEP 572):=运算符),可以通过在if条件中直接将匹配结果捕获为变量并将其在条件体内重复使用,从而对KrzysztofKrasoń解决方案进行一些改进:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

回答 2

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 3

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

回答 4

我可以推荐你去美丽汤。汤是一个很好的库,可以解析您的所有html文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

回答 5

尝试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 6

提供的代码段不能应付Exceptions 我的建议

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

如果未找到模式或第一个匹配项,则默认情况下返回空字符串。

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.


回答 7

我认为这足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

…假设您的文本(HTML)位于名为“ text”的变量中。

这也假定没有其他HTML标记可以合法地嵌入HTML TITLE标记内部,并且没有办法合法地将任何其他<字符嵌入这样的容器/块中。

但是

不要在Python中使用正则表达式进行HTML解析。使用HTML解析器!(除非您要编写完整的解析器,否则当标准库中已经包含各种HTML,SGML和XML解析器时,这将是一项额外的工作。

如果您处理“真实世界” 标记汤 HTML(通常不符合任何SGML / XML验证器),请使用BeautifulSoup包。它尚未出现在标准库中,但为此目的广泛建议使用。

另一个选项是:lxml …,它是为结构正确(符合标准的HTML)编写的。但是它可以选择退回到使用BeautifulSoup作为解析器:ElementSoup

I’d think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

… assuming that your text (HTML) is in a variable named “text.”

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However

Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.


Python列表目录,子目录和文件

问题:Python列表目录,子目录和文件

我试图制作一个脚本来列出给定目录中的所有目录,子目录和文件。
我尝试了这个:

import sys,os

root = "/home/patate/directory/"
path = os.path.join(root, "targetdirectory")

for r,d,f in os.walk(path):
    for file in f:
        print os.path.join(root,file)

不幸的是,它无法正常工作。
我得到所有文件,但没有完整的路径。

例如,如果dir结构为:

/home/patate/directory/targetdirectory/123/456/789/file.txt

它会打印:

/home/patate/directory/targetdirectory/file.txt

我需要的是第一个结果。任何帮助将不胜感激!谢谢。

I’m trying to make a script to list all directory, subdirectory, and files in a given directory.
I tried this:

import sys,os

root = "/home/patate/directory/"
path = os.path.join(root, "targetdirectory")

for r,d,f in os.walk(path):
    for file in f:
        print os.path.join(root,file)

Unfortunatly it doesn’t work properly.
I get all the files, but not their complete paths.

For example if the dir struct would be:

/home/patate/directory/targetdirectory/123/456/789/file.txt

It would print:

/home/patate/directory/targetdirectory/file.txt

What I need is the first result. Any help would be greatly appreciated! Thanks.


回答 0

使用os.path.join来连接的目录和文件

for path, subdirs, files in os.walk(root):
    for name in files:
        print os.path.join(path, name)

请注意,path并没有root在串联中使用,因为使用root会不正确。


在Python 3.4中,添加了pathlib模块以简化路径操作。因此,等效于os.path.join

pathlib.PurePath(path, name)

好处pathlib是您可以在路径上使用各种有用的方法。如果使用具体的Path变体,您还可以通过它们进行实际的OS调用,例如,进入目录,删除路径,打开其指向的文件等等。

Use os.path.join to concatenate the directory and file name:

for path, subdirs, files in os.walk(root):
    for name in files:
        print os.path.join(path, name)

Note the usage of path and not root in the concatenation, since using root would be incorrect.


In Python 3.4, the pathlib module was added for easier path manipulations. So the equivalent to os.path.join would be:

pathlib.PurePath(path, name)

The advantage of pathlib is that you can use a variety of useful methods on paths. If you use the concrete Path variant you can also do actual OS calls through them, like changing into a directory, deleting the path, opening the file it points to and much more.


回答 1

以防万一…获取目录和子目录中与某个模式匹配的所有文件(例如* .py):

import os
from fnmatch import fnmatch

root = '/some/directory'
pattern = "*.py"

for path, subdirs, files in os.walk(root):
    for name in files:
        if fnmatch(name, pattern):
            print os.path.join(path, name)

Just in case… Getting all files in the directory and subdirectories matching some pattern (*.py for example):

import os
from fnmatch import fnmatch

root = '/some/directory'
pattern = "*.py"

for path, subdirs, files in os.walk(root):
    for name in files:
        if fnmatch(name, pattern):
            print os.path.join(path, name)

回答 2

这里是单线:

import os

[val for sublist in [[os.path.join(i[0], j) for j in i[2]] for i in os.walk('./')] for val in sublist]
# Meta comment to ease selecting text

最外面的val for sublist in ...循环将列表平整为一维。该j循环收集每个文件名前缀的列表,并将其加入到当前的路径。最后,i循环遍历所有目录和子目录。

本示例./os.walk(...)调用中使用了硬编码的路径,您可以补充所需的任何路径字符串。

注意:os.path.expanduser和/或os.path.expandvars可用于路径字符串,例如~/

扩展此示例:

它易于添加文件基本名测试和目录名测试。

例如,测试*.jpg文件:

... for j in i[2] if j.endswith('.jpg')] ...

此外,排除.git目录:

... for i in os.walk('./') if '.git' not in i[0].split('/')]

Here is a one-liner:

import os

[val for sublist in [[os.path.join(i[0], j) for j in i[2]] for i in os.walk('./')] for val in sublist]
# Meta comment to ease selecting text

The outer most val for sublist in ... loop flattens the list to be one dimensional. The j loop collects a list of every file basename and joins it to the current path. Finally, the i loop iterates over all directories and sub directories.

This example uses the hard-coded path ./ in the os.walk(...) call, you can supplement any path string you like.

Note: os.path.expanduser and/or os.path.expandvars can be used for paths strings like ~/

Extending this example:

Its easy to add in file basename tests and directoryname tests.

For Example, testing for *.jpg files:

... for j in i[2] if j.endswith('.jpg')] ...

Additionally, excluding the .git directory:

... for i in os.walk('./') if '.git' not in i[0].split('/')]

回答 3

无法发表评论,请在此处写答案。这是我所看到的最清晰的一行:

import os
[os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]

Couldn’t comment so writing answer here. This is the clearest one-line I have seen:

import os
[os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]

回答 4

您可以看一下我制作的这个样本。它使用了os.path.walk函数,该函数已被提倡,使用列表存储所有文件路径

root = "Your root directory"
ex = ".txt"
where_to = "Wherever you wanna write your file to"
def fileWalker(ext,dirname,names):
    '''
    checks files in names'''
    pat = "*" + ext[0]
    for f in names:
        if fnmatch.fnmatch(f,pat):
            ext[1].append(os.path.join(dirname,f))


def writeTo(fList):

    with open(where_to,"w") as f:
        for di_r in fList:
            f.write(di_r + "\n")






if __name__ == '__main__':
    li = []
    os.path.walk(root,fileWalker,[ex,li])

    writeTo(li)

You can take a look at this sample I made. It uses the os.path.walk function which is deprecated beware.Uses a list to store all the filepaths

root = "Your root directory"
ex = ".txt"
where_to = "Wherever you wanna write your file to"
def fileWalker(ext,dirname,names):
    '''
    checks files in names'''
    pat = "*" + ext[0]
    for f in names:
        if fnmatch.fnmatch(f,pat):
            ext[1].append(os.path.join(dirname,f))


def writeTo(fList):

    with open(where_to,"w") as f:
        for di_r in fList:
            f.write(di_r + "\n")






if __name__ == '__main__':
    li = []
    os.path.walk(root,fileWalker,[ex,li])

    writeTo(li)

回答 5

一线简单一点:

import os
from itertools import product, chain

chain.from_iterable([[os.sep.join(w) for w in product([i[0]], i[2])] for i in os.walk(dir)])

A bit simpler one-liner:

import os
from itertools import product, chain

chain.from_iterable([[os.sep.join(w) for w in product([i[0]], i[2])] for i in os.walk(dir)])

回答 6

由于这里的每个示例都仅使用walk(with join),因此我想展示一个不错的示例并与进行比较listdir

import os, time

def listFiles1(root): # listdir
    allFiles = []; walk = [root]
    while walk:
        folder = walk.pop(0)+"/"; items = os.listdir(folder) # items = folders + files
        for i in items: i=folder+i; (walk if os.path.isdir(i) else allFiles).append(i)
    return allFiles

def listFiles2(root): # listdir/join (takes ~1.4x as long) (and uses '\\' instead)
    allFiles = []; walk = [root]
    while walk:
        folder = walk.pop(0); items = os.listdir(folder) # items = folders + files
        for i in items: i=os.path.join(folder,i); (walk if os.path.isdir(i) else allFiles).append(i)
    return allFiles

def listFiles3(root): # walk (takes ~1.5x as long)
    allFiles = []
    for folder, folders, files in os.walk(root):
        for file in files: allFiles+=[folder.replace("\\","/")+"/"+file] # folder+"\\"+file still ~1.5x
    return allFiles

def listFiles4(root): # walk/join (takes ~1.6x as long) (and uses '\\' instead)
    allFiles = []
    for folder, folders, files in os.walk(root):
        for file in files: allFiles+=[os.path.join(folder,file)]
    return allFiles


for i in range(100): files = listFiles1("src") # warm up

start = time.time()
for i in range(100): files = listFiles1("src") # listdir
print("Time taken: %.2fs"%(time.time()-start)) # 0.28s

start = time.time()
for i in range(100): files = listFiles2("src") # listdir and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.38s

start = time.time()
for i in range(100): files = listFiles3("src") # walk
print("Time taken: %.2fs"%(time.time()-start)) # 0.42s

start = time.time()
for i in range(100): files = listFiles4("src") # walk and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.47s

因此,如您所见,该listdir版本效率更高。(那join很慢)

Since every example here is just using walk (with join), i’d like to show a nice example and comparison with listdir:

import os, time

def listFiles1(root): # listdir
    allFiles = []; walk = [root]
    while walk:
        folder = walk.pop(0)+"/"; items = os.listdir(folder) # items = folders + files
        for i in items: i=folder+i; (walk if os.path.isdir(i) else allFiles).append(i)
    return allFiles

def listFiles2(root): # listdir/join (takes ~1.4x as long) (and uses '\\' instead)
    allFiles = []; walk = [root]
    while walk:
        folder = walk.pop(0); items = os.listdir(folder) # items = folders + files
        for i in items: i=os.path.join(folder,i); (walk if os.path.isdir(i) else allFiles).append(i)
    return allFiles

def listFiles3(root): # walk (takes ~1.5x as long)
    allFiles = []
    for folder, folders, files in os.walk(root):
        for file in files: allFiles+=[folder.replace("\\","/")+"/"+file] # folder+"\\"+file still ~1.5x
    return allFiles

def listFiles4(root): # walk/join (takes ~1.6x as long) (and uses '\\' instead)
    allFiles = []
    for folder, folders, files in os.walk(root):
        for file in files: allFiles+=[os.path.join(folder,file)]
    return allFiles


for i in range(100): files = listFiles1("src") # warm up

start = time.time()
for i in range(100): files = listFiles1("src") # listdir
print("Time taken: %.2fs"%(time.time()-start)) # 0.28s

start = time.time()
for i in range(100): files = listFiles2("src") # listdir and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.38s

start = time.time()
for i in range(100): files = listFiles3("src") # walk
print("Time taken: %.2fs"%(time.time()-start)) # 0.42s

start = time.time()
for i in range(100): files = listFiles4("src") # walk and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.47s

So as you can see for yourself, the listdir version is much more efficient. (and that join is slow)


Python的json模块,将int字典键转换为字符串

问题:Python的json模块,将int字典键转换为字符串

我发现运行以下命令时,python的json模块(自2.6起包含)将int字典键转换为字符串。

>>> import json
>>> releases = {1: "foo-v0.1"}
>>> json.dumps(releases)
'{"1": "foo-v0.1"}'

有什么简单的方法可以将键保留为int,而无需在转储和加载时解析字符串。我相信可以使用json模块提供的钩子,但这仍然需要解析。我可能会忽略一个论点吗?欢呼声,查兹

子问题:感谢您的回答。看到j​​son像我所担心的那样工作,是否有一种简单的方法可以通过解析转储的输出来传达密钥类型?我还要注意执行转储的代码以及从服务器下载json对象并加载它的代码均由我编写。

I have found that when the following is run, python’s json module (included since 2.6) converts int dictionary keys to strings.

>>> import json
>>> releases = {1: "foo-v0.1"}
>>> json.dumps(releases)
'{"1": "foo-v0.1"}'

Is there any easy way to preserve the key as an int, without needing to parse the string on dump and load. I believe it would be possible using the hooks provided by the json module, but again this still requires parsing. Is there possibly an argument I have overlooked? cheers, chaz

Sub-question: Thanks for the answers. Seeing as json works as I feared, is there an easy way to convey key type by maybe parsing the output of dumps? Also I should note the code doing the dumping and the code downloading the json object from a server and loading it, are both written by me.


回答 0

这是可能困扰您的各种映射集合之间的细微差别之一。JSON将键视为字符串;Python支持仅在类型上不同的独特键。

在Python中(显然在Lua中),映射的键(分别是字典或表)是对象引用。在Python中,它们必须是不可变的类型,或者它们必须是实现__hash__方法的对象。(Lua的文档建议即使对于可变对象,它也会自动将对象的ID用作哈希/键,并依赖于字符串插入以确保等效的字符串映射到相同的对象)。

在Perl,Javascript,awk和许多其他语言中,哈希,关联数组或给定语言所调用的名称的键是字符串(或Perl中的“标量”)。在Perl $foo{1}, $foo{1.0}, and $foo{"1"}是在相同的对应的所有引用%foo—关键是评估作为标!

JSON是从Javascript序列化技术开始的。(JSON代表Ĵ AVA 小号 CRIPT ö bject Ñ浮选。)当然它实现为它的映射符号的语义这与它的映射语义一致。

如果序列化的两端都将是Python,那么最好使用咸菜。如果您真的需要将这些从JSON转换回本机Python对象,我想您有两种选择。首先try: ... except: ...,如果字典查找失败,您可以尝试()将任何键转换为数字。或者,如果将代码添加到另一端(此JSON数据的序列化器或生成器),则可以让它对每个键值执行JSON序列化—将其作为键列表提供。(然后,您的Python代码将首先在键列表上进行迭代,将它们实例化/反序列化为本地Python对象…,然后使用那些键来访问映射中的值)。

This is one of those subtle differences among various mapping collections that can bite you. JSON treats keys as strings; Python supports distinct keys differing only in type.

In Python (and apparently in Lua) the keys to a mapping (dictionary or table, respectively) are object references. In Python they must be immutable types, or they must be objects which implement a __hash__ method. (The Lua docs suggest that it automatically uses the object’s ID as a hash/key even for mutable objects and relies on string interning to ensure that equivalent strings map to the same objects).

In Perl, Javascript, awk and many other languages the keys for hashes, associative arrays or whatever they’re called for the given language, are strings (or “scalars” in Perl). In perl $foo{1}, $foo{1.0}, and $foo{"1"} are all references to the same mapping in %foo — the key is evaluated as a scalar!

JSON started as a Javascript serialization technology. (JSON stands for JavaScript Object Notation.) Naturally it implements semantics for its mapping notation which are consistent with its mapping semantics.

If both ends of your serialization are going to be Python then you’d be better off using pickles. If you really need to convert these back from JSON into native Python objects I guess you have a couple of choices. First you could try (try: ... except: ...) to convert any key to a number in the event of a dictionary look-up failure. Alternatively, if you add code to the other end (the serializer or generator of this JSON data) then you could have it perform a JSON serialization on each of the key values — providing those as a list of keys. (Then your Python code would first iterate over the list of keys, instantiating/deserializing them into native Python objects … and then use those for access the values out of the mapping).


回答 1

不,JavaScript中没有数字键之类的东西。所有对象属性都将转换为String。

var a= {1: 'a'};
for (k in a)
    alert(typeof k); // 'string'

这可能会导致一些奇怪的行为:

a[999999999999999999999]= 'a'; // this even works on Array
alert(a[1000000000000000000000]); // 'a'
alert(a['999999999999999999999']); // fail
alert(a['1e+21']); // 'a'

JavaScript对象并不是真正正确的映射,因为您会在Python之类的语言中理解它,并且使用非String的键会导致怪异。这就是为什么JSON总是显式地将键写为字符串的原因,即使在不需要的地方也是如此。

No, there is no such thing as a Number key in JavaScript. All object properties are converted to String.

var a= {1: 'a'};
for (k in a)
    alert(typeof k); // 'string'

This can lead to some curious-seeming behaviours:

a[999999999999999999999]= 'a'; // this even works on Array
alert(a[1000000000000000000000]); // 'a'
alert(a['999999999999999999999']); // fail
alert(a['1e+21']); // 'a'

JavaScript Objects aren’t really proper mappings as you’d understand it in languages like Python, and using keys that aren’t String results in weirdness. This is why JSON always explicitly writes keys as strings, even where it doesn’t look necessary.


回答 2

或者,您也可以尝试在使用json进行编码的同时将字典转换为[(k1,v1),(k2,v2)]格式的列表,并在将其解码后将其转换回字典。


>>>> import json
>>>> json.dumps(releases.items())
    '[[1, "foo-v0.1"]]'
>>>> releases = {1: "foo-v0.1"}
>>>> releases == dict(json.loads(json.dumps(releases.items())))
     True
我相信这将需要更多的工作,例如具有某种标志,以识别从json解码回去后将要转换为字典的所有参数。

Alternatively you can also try converting dictionary to a list of [(k1,v1),(k2,v2)] format while encoding it using json, and converting it back to dictionary after decoding it back.


>>>> import json
>>>> json.dumps(releases.items())
    '[[1, "foo-v0.1"]]'
>>>> releases = {1: "foo-v0.1"}
>>>> releases == dict(json.loads(json.dumps(releases.items())))
     True
I believe this will need some more work like having some sort of flag to identify what all parameters to be converted to dictionary after decoding it back from json.

回答 3

回答您的子问题:

可以通过使用 json.loads(jsonDict, object_hook=jsonKeys2int)

def jsonKeys2int(x):
    if isinstance(x, dict):
            return {int(k):v for k,v in x.items()}
    return x

此功能也适用于嵌套词典,并使用词典理解。

如果您也想强制转换值,请使用:

def jsonKV2int(x):
    if isinstance(x, dict):
            return {int(k):(int(v) if isinstance(v, unicode) else v) for k,v in x.items()}
    return x

它测试值的实例并仅在它们是字符串对象(确切地说是unicode)时才将其强制转换。

这两个函数均假定键(和值)为整数。

谢谢:

如何在字典理解中使用if / else?

在字典中将字符串键转换为int

Answering your subquestion:

It can be accomplished by using json.loads(jsonDict, object_hook=jsonKeys2int)

def jsonKeys2int(x):
    if isinstance(x, dict):
            return {int(k):v for k,v in x.items()}
    return x

This function will also work for nested dicts and uses a dict comprehension.

If you want to to cast the values too, use:

def jsonKV2int(x):
    if isinstance(x, dict):
            return {int(k):(int(v) if isinstance(v, unicode) else v) for k,v in x.items()}
    return x

Which tests the instance of the values and casts them only if they are strings objects (unicode to be exact).

Both functions assumes keys (and values) to be integers.

Thanks to:

How to use if/else in a dictionary comprehension?

Convert a string key to int in a Dictionary


回答 4

我被同样的问题咬了。正如其他人指出的那样,在JSON中,映射键必须是字符串。您可以做两件事之一。您可以使用不太严格的JSON库,例如demjson,它允许整数字符串。如果没有其他程序(或其他语言的其他语言)无法读取它,那么您应该可以。或者,您可以使用其他序列化语言。我不建议泡菜。它很难阅读,并非旨在确保安全。相反,我建议使用YAML,它几乎是JSON的超集,并且确实允许整数键。(至少PyYAML这样做。)

I’ve gotten bitten by the same problem. As others have pointed out, in JSON, the mapping keys must be strings. You can do one of two things. You can use a less strict JSON library, like demjson, which allows integer strings. If no other programs (or no other in other languages) are going to read it, then you should be okay. Or you can use a different serialization language. I wouldn’t suggest pickle. It’s hard to read, and is not designed to be secure. Instead, I’d suggest YAML, which is (nearly) a superset of JSON, and does allow integer keys. (At least PyYAML does.)


回答 5

使用将字典转换为字符串str(dict),然后执行以下操作将其转换回dict:

import ast
ast.literal_eval(string)

Convert the dictionary to be string by using str(dict) and then convert it back to dict by doing this:

import ast
ast.literal_eval(string)

回答 6

这是我的解决方案!我用过object_hook,当您嵌套时很有用json

>>> import json
>>> json_data = '{"1": "one", "2": {"-3": "minus three", "4": "four"}}'
>>> py_dict = json.loads(json_data, object_hook=lambda d: {int(k) if k.lstrip('-').isdigit() else k: v for k, v in d.items()})

>>> py_dict
{1: 'one', 2: {-3: 'minus three', 4: 'four'}}

仅用于将json键解析为int的过滤器。您也可以将int(v) if v.lstrip('-').isdigit() else v过滤器用于json值。

Here is my solution! I used object_hook, it is useful when you have nested json

>>> import json
>>> json_data = '{"1": "one", "2": {"-3": "minus three", "4": "four"}}'
>>> py_dict = json.loads(json_data, object_hook=lambda d: {int(k) if k.lstrip('-').isdigit() else k: v for k, v in d.items()})

>>> py_dict
{1: 'one', 2: {-3: 'minus three', 4: 'four'}}

There is filter only for parsing json key to int. You can use int(v) if v.lstrip('-').isdigit() else v filter for json value too.


回答 7

我对Murmel的答案做了一个非常简单的扩展,我认为它可以在相当随意的字典(包括嵌套字典)上工作,前提是它首先可以被JSON转储。任何可以解释为整数的键都将转换为int。毫无疑问,这不是很有效,但是它可以实现我存储到json字符串和从json字符串加载的目的。

def convert_keys_to_int(d: dict):
    new_dict = {}
    for k, v in d.items():
        try:
            new_key = int(k)
        except ValueError:
            new_key = k
        if type(v) == dict:
            v = _convert_keys_to_int(v)
        new_dict[new_key] = v
    return new_dict

假设原始字典中的所有键都是整数(如果可以将它们强制转换为int),则在将其存储为json后将返回原始字典。例如

>>>d = {1: 3, 2: 'a', 3: {1: 'a', 2: 10}, 4: {'a': 2, 'b': 10}}
>>>convert_keys_to_int(json.loads(json.dumps(d)))  == d
True

I made a very simple extension of Murmel’s answer which I think will work on a pretty arbitrary dictionary (including nested) assuming it can be dumped by JSON in the first place. Any keys which can be interpreted as integers will be cast to int. No doubt this is not very efficient, but it works for my purposes of storing to and loading from json strings.

def convert_keys_to_int(d: dict):
    new_dict = {}
    for k, v in d.items():
        try:
            new_key = int(k)
        except ValueError:
            new_key = k
        if type(v) == dict:
            v = _convert_keys_to_int(v)
        new_dict[new_key] = v
    return new_dict

Assuming that all keys in the original dict are integers if they can be cast to int, then this will return the original dictionary after storing as a json. e.g.

>>>d = {1: 3, 2: 'a', 3: {1: 'a', 2: 10}, 4: {'a': 2, 'b': 10}}
>>>convert_keys_to_int(json.loads(json.dumps(d)))  == d
True

回答 8

你可以写你json.dumps自己,这里是从例如djsonencoder.py。您可以像这样使用它:

assert dumps({1: "abc"}) == '{1: "abc"}'

You can write your json.dumps by yourself, here is a example from djson: encoder.py. You can use it like this:

assert dumps({1: "abc"}) == '{1: "abc"}'

在Python中,如何读取图像的exif数据?

问题:在Python中,如何读取图像的exif数据?

我正在使用PIL。如何将EXIF数据转换为字典?

I’m using PIL. How do I turn the EXIF data of a picture into a dictionary?


回答 0

试试这个:

import PIL.Image
img = PIL.Image.open('img.jpg')
exif_data = img._getexif()

这应该给您一个由EXIF数字标签索引的字典。如果您希望字典由实际的EXIF标记名称字符串索引,请尝试以下操作:

import PIL.ExifTags
exif = {
    PIL.ExifTags.TAGS[k]: v
    for k, v in img._getexif().items()
    if k in PIL.ExifTags.TAGS
}

You can use the _getexif() protected method of a PIL Image.

import PIL.Image
img = PIL.Image.open('img.jpg')
exif_data = img._getexif()

This should give you a dictionary indexed by EXIF numeric tags. If you want the dictionary indexed by the actual EXIF tag name strings, try something like:

import PIL.ExifTags
exif = {
    PIL.ExifTags.TAGS[k]: v
    for k, v in img._getexif().items()
    if k in PIL.ExifTags.TAGS
}

回答 1

您还可以使用ExifRead模块:

import exifread
# Open image file for reading (binary mode)
f = open(path_name, 'rb')

# Return Exif tags
tags = exifread.process_file(f)

You can also use the ExifRead module:

import exifread
# Open image file for reading (binary mode)
f = open(path_name, 'rb')

# Return Exif tags
tags = exifread.process_file(f)

回答 2

我用这个:

import os,sys
from PIL import Image
from PIL.ExifTags import TAGS

for (k,v) in Image.open(sys.argv[1])._getexif().iteritems():
        print '%s = %s' % (TAGS.get(k), v)

或获取特定字段:

def get_field (exif,field) :
  for (k,v) in exif.iteritems():
     if TAGS.get(k) == field:
        return v

exif = image._getexif()
print get_field(exif,'ExposureTime')

I use this:

import os,sys
from PIL import Image
from PIL.ExifTags import TAGS

for (k,v) in Image.open(sys.argv[1])._getexif().items():
        print('%s = %s' % (TAGS.get(k), v))

or to get a specific field:

def get_field (exif,field) :
  for (k,v) in exif.items():
     if TAGS.get(k) == field:
        return v

exif = image._getexif()
print get_field(exif,'ExposureTime')

回答 3

对于Python3.x和starting Pillow==6.0.0Image对象现在提供了getexif()一种返回的方法,<class 'PIL.Image.Exif'>或者None该图像没有EXIF数据。

Pillow 6.0.0发行说明

getexif()已添加,它返回一个Exif实例。可以像字典一样检索和设置值。保存JPEG,PNG或WEBP时,可以将实例作为exif参数传递,以包括输出图像中的所有更改。

所述Exif输出可以被简单地浇铸到一个dict,从而使EXIF数据然后可以作为一个常规的键-值对被访问dict。键是16位整数,可以使用ExifTags.TAGS模块映射到其字符串名称。

from PIL import Image, ExifTags

img = Image.open("sample.jpg")
img_exif = img.getexif()
print(type(img_exif))
# <class 'PIL.Image.Exif'>

if img_exif is None:
    print("Sorry, image has no exif data.")
else:
    img_exif_dict = dict(img_exif)
    print(img_exif_dict)
    # { ... 42035: 'FUJIFILM', 42036: 'XF23mmF2 R WR', 42037: '75A14188' ... }
    for key, val in img_exif_dict.items():
        if key in ExifTags.TAGS:
            print(f"{ExifTags.TAGS[key]}:{repr(val)}")
            # ExifVersion:b'0230'
            # ...
            # FocalLength:(2300, 100)
            # ColorSpace:1
            # FocalLengthIn35mmFilm:35
            # ...
            # Model:'X-T2'
            # Make:'FUJIFILM'
            # ...
            # DateTime:'2019:12:01 21:30:07'
            # ...

使用Python 3.6.8和Pillow==6.0.0

For Python3.x and starting Pillow==6.0.0, Image objects now provide a getexif() method that returns <class 'PIL.Image.Exif'> or None if the image has no EXIF data.

From Pillow 6.0.0 release notes:

getexif() has been added, which returns an Exif instance. Values can be retrieved and set like a dictionary. When saving JPEG, PNG or WEBP, the instance can be passed as an exif argument to include any changes in the output image.

As stated, the Exif output can simply be casted to a dict with the EXIF data accessible as regular key-value pairs. The keys are 16-bit integers that can be mapped to their string names using the ExifTags.TAGS module.

from PIL import Image, ExifTags

img = Image.open("sample.jpg")
img_exif = img.getexif()
print(type(img_exif))
# <class 'PIL.Image.Exif'>

if img_exif is None:
    print("Sorry, image has no exif data.")
else:
    img_exif_dict = dict(img_exif)
    print(img_exif_dict)
    # { ... 42035: 'FUJIFILM', 42036: 'XF23mmF2 R WR', 42037: '75A14188' ... }
    for key, val in img_exif_dict.items():
        if key in ExifTags.TAGS:
            print(f"{ExifTags.TAGS[key]}:{repr(val)}")
            # ExifVersion:b'0230'
            # ...
            # FocalLength:(2300, 100)
            # ColorSpace:1
            # FocalLengthIn35mmFilm:35
            # ...
            # Model:'X-T2'
            # Make:'FUJIFILM'
            # ...
            # DateTime:'2019:12:01 21:30:07'
            # ...

Tested with Python 3.6.8 and Pillow==6.0.0.


回答 4

import sys
import PIL
import PIL.Image as PILimage
from PIL import ImageDraw, ImageFont, ImageEnhance
from PIL.ExifTags import TAGS, GPSTAGS



class Worker(object):
    def __init__(self, img):
        self.img = img
        self.exif_data = self.get_exif_data()
        self.lat = self.get_lat()
        self.lon = self.get_lon()
        self.date =self.get_date_time()
        super(Worker, self).__init__()

    @staticmethod
    def get_if_exist(data, key):
        if key in data:
            return data[key]
        return None

    @staticmethod
    def convert_to_degress(value):
        """Helper function to convert the GPS coordinates
        stored in the EXIF to degress in float format"""
        d0 = value[0][0]
        d1 = value[0][1]
        d = float(d0) / float(d1)
        m0 = value[1][0]
        m1 = value[1][1]
        m = float(m0) / float(m1)

        s0 = value[2][0]
        s1 = value[2][1]
        s = float(s0) / float(s1)

        return d + (m / 60.0) + (s / 3600.0)

    def get_exif_data(self):
        """Returns a dictionary from the exif data of an PIL Image item. Also
        converts the GPS Tags"""
        exif_data = {}
        info = self.img._getexif()
        if info:
            for tag, value in info.items():
                decoded = TAGS.get(tag, tag)
                if decoded == "GPSInfo":
                    gps_data = {}
                    for t in value:
                        sub_decoded = GPSTAGS.get(t, t)
                        gps_data[sub_decoded] = value[t]

                    exif_data[decoded] = gps_data
                else:
                    exif_data[decoded] = value
        return exif_data

    def get_lat(self):
        """Returns the latitude and longitude, if available, from the 
        provided exif_data (obtained through get_exif_data above)"""
        # print(exif_data)
        if 'GPSInfo' in self.exif_data:
            gps_info = self.exif_data["GPSInfo"]
            gps_latitude = self.get_if_exist(gps_info, "GPSLatitude")
            gps_latitude_ref = self.get_if_exist(gps_info, 'GPSLatitudeRef')
            if gps_latitude and gps_latitude_ref:
                lat = self.convert_to_degress(gps_latitude)
                if gps_latitude_ref != "N":
                    lat = 0 - lat
                lat = str(f"{lat:.{5}f}")
                return lat
        else:
            return None

    def get_lon(self):
        """Returns the latitude and longitude, if available, from the 
        provided exif_data (obtained through get_exif_data above)"""
        # print(exif_data)
        if 'GPSInfo' in self.exif_data:
            gps_info = self.exif_data["GPSInfo"]
            gps_longitude = self.get_if_exist(gps_info, 'GPSLongitude')
            gps_longitude_ref = self.get_if_exist(gps_info, 'GPSLongitudeRef')
            if gps_longitude and gps_longitude_ref:
                lon = self.convert_to_degress(gps_longitude)
                if gps_longitude_ref != "E":
                    lon = 0 - lon
                lon = str(f"{lon:.{5}f}")
                return lon
        else:
            return None

    def get_date_time(self):
        if 'DateTime' in self.exif_data:
            date_and_time = self.exif_data['DateTime']
            return date_and_time 

if __name__ == '__main__':
    try:
        img = PILimage.open(sys.argv[1])
        image = Worker(img)
        lat = image.lat
        lon = image.lon
        date = image.date
        print(date, lat, lon)

    except Exception as e:
        print(e)
import sys
import PIL
import PIL.Image as PILimage
from PIL import ImageDraw, ImageFont, ImageEnhance
from PIL.ExifTags import TAGS, GPSTAGS



class Worker(object):
    def __init__(self, img):
        self.img = img
        self.exif_data = self.get_exif_data()
        self.lat = self.get_lat()
        self.lon = self.get_lon()
        self.date =self.get_date_time()
        super(Worker, self).__init__()

    @staticmethod
    def get_if_exist(data, key):
        if key in data:
            return data[key]
        return None

    @staticmethod
    def convert_to_degress(value):
        """Helper function to convert the GPS coordinates
        stored in the EXIF to degress in float format"""
        d0 = value[0][0]
        d1 = value[0][1]
        d = float(d0) / float(d1)
        m0 = value[1][0]
        m1 = value[1][1]
        m = float(m0) / float(m1)

        s0 = value[2][0]
        s1 = value[2][1]
        s = float(s0) / float(s1)

        return d + (m / 60.0) + (s / 3600.0)

    def get_exif_data(self):
        """Returns a dictionary from the exif data of an PIL Image item. Also
        converts the GPS Tags"""
        exif_data = {}
        info = self.img._getexif()
        if info:
            for tag, value in info.items():
                decoded = TAGS.get(tag, tag)
                if decoded == "GPSInfo":
                    gps_data = {}
                    for t in value:
                        sub_decoded = GPSTAGS.get(t, t)
                        gps_data[sub_decoded] = value[t]

                    exif_data[decoded] = gps_data
                else:
                    exif_data[decoded] = value
        return exif_data

    def get_lat(self):
        """Returns the latitude and longitude, if available, from the 
        provided exif_data (obtained through get_exif_data above)"""
        # print(exif_data)
        if 'GPSInfo' in self.exif_data:
            gps_info = self.exif_data["GPSInfo"]
            gps_latitude = self.get_if_exist(gps_info, "GPSLatitude")
            gps_latitude_ref = self.get_if_exist(gps_info, 'GPSLatitudeRef')
            if gps_latitude and gps_latitude_ref:
                lat = self.convert_to_degress(gps_latitude)
                if gps_latitude_ref != "N":
                    lat = 0 - lat
                lat = str(f"{lat:.{5}f}")
                return lat
        else:
            return None

    def get_lon(self):
        """Returns the latitude and longitude, if available, from the 
        provided exif_data (obtained through get_exif_data above)"""
        # print(exif_data)
        if 'GPSInfo' in self.exif_data:
            gps_info = self.exif_data["GPSInfo"]
            gps_longitude = self.get_if_exist(gps_info, 'GPSLongitude')
            gps_longitude_ref = self.get_if_exist(gps_info, 'GPSLongitudeRef')
            if gps_longitude and gps_longitude_ref:
                lon = self.convert_to_degress(gps_longitude)
                if gps_longitude_ref != "E":
                    lon = 0 - lon
                lon = str(f"{lon:.{5}f}")
                return lon
        else:
            return None

    def get_date_time(self):
        if 'DateTime' in self.exif_data:
            date_and_time = self.exif_data['DateTime']
            return date_and_time 

if __name__ == '__main__':
    try:
        img = PILimage.open(sys.argv[1])
        image = Worker(img)
        lat = image.lat
        lon = image.lon
        date = image.date
        print(date, lat, lon)

    except Exception as e:
        print(e)

回答 5

我发现使用._getexif在更高版本的python中不起作用,而且,它是受保护的类,应该尽可能避免使用它。在深入调试器之后,这是我发现获取图像的EXIF数据的最佳方法:

from PIL import Image

def get_exif(path):
    return Image.open(path).info['parsed_exif']

这将返回图像的所有EXIF数据的字典。

注意:对于Python3.x,请使用Pillow而不是PIL

I have found that using ._getexif doesn’t work in higher python versions, moreover, it is a protected class and one should avoid using it if possible. After digging around the debugger this is what I found to be the best way to get the EXIF data for an image:

from PIL import Image

def get_exif(path):
    return Image.open(path).info['parsed_exif']

This returns a dictionary of all the EXIF data of an image.

Note: For Python3.x use Pillow instead of PIL


回答 6

这是一个可能更容易阅读的内容。希望这会有所帮助。

from PIL import Image
from PIL import ExifTags

exifData = {}
img = Image.open(picture.jpg)
exifDataRaw = img._getexif()
for tag, value in exifDataRaw.items():
    decodedTag = ExifTags.TAGS.get(tag, tag)
    exifData[decodedTag] = value

Here’s the one that may be little easier to read. Hope this is helpful.

from PIL import Image
from PIL import ExifTags

exifData = {}
img = Image.open(picture.jpg)
exifDataRaw = img._getexif()
for tag, value in exifDataRaw.items():
    decodedTag = ExifTags.TAGS.get(tag, tag)
    exifData[decodedTag] = value

回答 7

我通常使用pyexiv2在JPG文件中设置exif信息,但是当我在脚本QGIS脚本中导入库时崩溃。

我找到了使用库exif的解决方案:

https://pypi.org/project/exif/

它是如此易于使用,而且使用Qgis我没有任何问题。

在此代码中,我将GPS坐标插入屏幕快照:

from exif import Image
with open(file_name, 'rb') as image_file:
    my_image = Image(image_file)

my_image.make = "Python"
my_image.gps_latitude_ref=exif_lat_ref
my_image.gps_latitude=exif_lat
my_image.gps_longitude_ref= exif_lon_ref
my_image.gps_longitude= exif_lon

with open(file_name, 'wb') as new_image_file:
    new_image_file.write(my_image.get_file())

I usually use pyexiv2 to set exif information in JPG files, but when I import the library in a script QGIS script crash.

I found a solution using the library exif:

https://pypi.org/project/exif/

It’s so easy to use, and with Qgis I don,’t have any problem.

In this code I insert GPS coordinates to a snapshot of screen:

from exif import Image
with open(file_name, 'rb') as image_file:
    my_image = Image(image_file)

my_image.make = "Python"
my_image.gps_latitude_ref=exif_lat_ref
my_image.gps_latitude=exif_lat
my_image.gps_longitude_ref= exif_lon_ref
my_image.gps_longitude= exif_lon

with open(file_name, 'wb') as new_image_file:
    new_image_file.write(my_image.get_file())

使用SQLAlchemy ORM批量插入

问题:使用SQLAlchemy ORM批量插入

有什么方法可以让SQLAlchemy进行批量插入,而不是插入每个对象。即

在做:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

而不是:

INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)

我刚刚将一些代码转换为使用sqlalchemy而不是原始sql,尽管现在使用它起来要好得多,但现在似乎要慢一些(最多10倍),我想知道这是否是原因。

也许我可以更有效地使用会话来改善这种情况。目前,我已经添加了一些东西,autoCommit=False并做了一个session.commit()。尽管如果在其他地方更改了数据库,这似乎会使数据过时,例如,即使我执行新查询,我仍然可以返回旧结果?

谢谢你的帮助!

Is there any way to get SQLAlchemy to do a bulk insert rather than inserting each individual object. i.e.,

doing:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

rather than:

INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)

I’ve just converted some code to use sqlalchemy rather than raw sql and although it is now much nicer to work with it seems to be slower now (up to a factor of 10), I’m wondering if this is the reason.

May be I could improve the situation using sessions more efficiently. At the moment I have autoCommit=False and do a session.commit() after I’ve added some stuff. Although this seems to cause the data to go stale if the DB is changed elsewhere, like even if I do a new query I still get old results back?

Thanks for your help!


回答 0

SQLAlchemy在版本中引入了该功能1.0.0

批量操作-SQLAlchemy文档

通过这些操作,您现在可以批量插入或更新!

例如,您可以执行以下操作:

s = Session()
objects = [
    User(name="u1"),
    User(name="u2"),
    User(name="u3")
]
s.bulk_save_objects(objects)
s.commit()

在这里,将制成大量插入物。

SQLAlchemy introduced that in version 1.0.0:

Bulk operations – SQLAlchemy docs

With these operations, you can now do bulk inserts or updates!

For instance, you can do:

s = Session()
objects = [
    User(name="u1"),
    User(name="u2"),
    User(name="u3")
]
s.bulk_save_objects(objects)
s.commit()

Here, a bulk insert will be made.


回答 1

sqlalchemy文档对可用于批量插入的各种技术的性能进行了总结

ORM基本上不是用于高性能批量插入的-这是SQLAlchemy除了将ORM作为一流组件之外还提供Core的全部原因。

对于快速批量插入的用例,ORM所基于的SQL生成和执行系统是Core的一部分。直接使用该系统,我们可以产生与直接使用原始数据库API相比具有竞争力的INSERT。

另外,SQLAlchemy ORM提供了Bulk Operations方法套件,该套件提供了到工作单元过程各部分的挂钩,以便发出基于ORM的自动化程度较低的Core级INSERT和UPDATE构造。

下面的示例说明了基于时间的测试,该测试针对从自动程度最高到最少的几种不同的行插入方法。使用cPython 2.7,可以观察到运行时:

classics-MacBook-Pro:sqlalchemy classic$ python test.py
SQLAlchemy ORM: Total time for 100000 records 12.0471920967 secs
SQLAlchemy ORM pk given: Total time for 100000 records 7.06283402443 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.856323003769 secs
SQLAlchemy Core: Total time for 100000 records 0.485800027847 secs
sqlite3: Total time for 100000 records 0.487842082977 sec

脚本:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())
engine = None


class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))


def init_sqlalchemy(dbname='sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)


def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM pk given: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_bulk_insert(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    n1 = n
    while n1 > 0:
        n1 = n1 - 10000
        DBSession.bulk_insert_mappings(
            Customer,
            [
                dict(name="NAME " + str(i))
                for i in xrange(min(10000, n1))
            ]
        )
    DBSession.commit()
    print(
        "SQLAlchemy ORM bulk_save_objects(): Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )
    print(
        "SQLAlchemy Core: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute(
        "CREATE TABLE customer (id INTEGER NOT NULL, "
        "name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn


def test_sqlite3(n=100000, dbname='sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in xrange(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print(
        "sqlite3: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " sec")

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_orm_bulk_insert(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

The sqlalchemy docs have a writeup on the performance of various techniques that can be used for bulk inserts:

ORMs are basically not intended for high-performance bulk inserts – this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.

For the use case of fast bulk inserts, the SQL generation and execution system that the ORM builds on top of is part of the Core. Using this system directly, we can produce an INSERT that is competitive with using the raw database API directly.

Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation.

The example below illustrates time-based tests for several different methods of inserting rows, going from the most automated to the least. With cPython 2.7, runtimes observed:

classics-MacBook-Pro:sqlalchemy classic$ python test.py
SQLAlchemy ORM: Total time for 100000 records 12.0471920967 secs
SQLAlchemy ORM pk given: Total time for 100000 records 7.06283402443 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.856323003769 secs
SQLAlchemy Core: Total time for 100000 records 0.485800027847 secs
sqlite3: Total time for 100000 records 0.487842082977 sec

Script:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())
engine = None


class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))


def init_sqlalchemy(dbname='sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)


def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM pk given: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_bulk_insert(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    n1 = n
    while n1 > 0:
        n1 = n1 - 10000
        DBSession.bulk_insert_mappings(
            Customer,
            [
                dict(name="NAME " + str(i))
                for i in xrange(min(10000, n1))
            ]
        )
    DBSession.commit()
    print(
        "SQLAlchemy ORM bulk_save_objects(): Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )
    print(
        "SQLAlchemy Core: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute(
        "CREATE TABLE customer (id INTEGER NOT NULL, "
        "name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn


def test_sqlite3(n=100000, dbname='sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in xrange(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print(
        "sqlite3: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " sec")

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_orm_bulk_insert(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

回答 2

据我所知,没有办法让ORM发出批量插入。我认为根本原因是SQLAlchemy需要跟踪每个对象的身份(即新的主键),而大容量插入会对此产生干扰。例如,假设您的foo表包含一id列并映射到一个Foo类:

x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1

由于SQLAlchemy在x.id不发出另一个查询的情况下获取了该值,因此我们可以推断出它直接从该INSERT语句中获取了该值。如果不需要随后通过相同实例访问创建的对象,则可以跳过ORM层进行插入:

Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))

SQLAlchemy无法将这些新行与任何现有对象匹配,因此您必须重新查询它们以进行任何后续操作。

至于过时的数据,记住该会话没有内置的方式来了解何时在会话外更改数据库是很有帮助的。为了通过现有实例访问外部修改的数据,必须将这些实例标记为expired。默认情况下会发生这种情况session.commit(),但可以通过调用session.expire_all()或手动完成session.expire(instance)。一个例子(省略SQL):

x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42

session.commit()expires x,因此第一个打印语句隐式打开一个新事务并重新查询x属性。如果注释掉第一个打印语句,您会注意到第二个打印语句现在会选择正确的值,因为直到更新后才会发出新查询。

从事务隔离的角度来看,这是有道理的-您只应在事务之间进行外部修改。如果这给您带来麻烦,建议您弄清或重新考虑应用程序的事务边界,而不要立即进行操作session.expire_all()

As far as I know, there is no way to get the ORM to issue bulk inserts. I believe the underlying reason is that SQLAlchemy needs to keep track of each object’s identity (i.e., new primary keys), and bulk inserts interfere with that. For example, assuming your foo table contains an id column and is mapped to a Foo class:

x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1

Since SQLAlchemy picked up the value for x.id without issuing another query, we can infer that it got the value directly from the INSERT statement. If you don’t need subsequent access to the created objects via the same instances, you can skip the ORM layer for your insert:

Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))

SQLAlchemy can’t match these new rows with any existing objects, so you’ll have to query them anew for any subsequent operations.

As far as stale data is concerned, it’s helpful to remember that the session has no built-in way to know when the database is changed outside of the session. In order to access externally modified data through existing instances, the instances must be marked as expired. This happens by default on session.commit(), but can be done manually by calling session.expire_all() or session.expire(instance). An example (SQL omitted):

x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42

session.commit() expires x, so the first print statement implicitly opens a new transaction and re-queries x‘s attributes. If you comment out the first print statement, you’ll notice that the second one now picks up the correct value, because the new query isn’t emitted until after the update.

This makes sense from the point of view of transactional isolation – you should only pick up external modifications between transactions. If this is causing you trouble, I’d suggest clarifying or re-thinking your application’s transaction boundaries instead of immediately reaching for session.expire_all().


回答 3

我通常使用add_all

from app import session
from models import User

objects = [User(name="u1"), User(name="u2"), User(name="u3")]
session.add_all(objects)
session.commit()

I usually do it using add_all.

from app import session
from models import User

objects = [User(name="u1"), User(name="u2"), User(name="u3")]
session.add_all(objects)
session.commit()

回答 4

从0.8版开始,直接支持已添加到SQLAlchemy

根据docsconnection.execute(table.insert().values(data))应该可以解决问题。(请注意,这是一样的connection.execute(table.insert(), data)通过将呼叫这导致许多个别行插入executemany)。除了本地连接之外,其他任何方面的性能差异都可能很大。

Direct support was added to SQLAlchemy as of version 0.8

As per the docs, connection.execute(table.insert().values(data)) should do the trick. (Note that this is not the same as connection.execute(table.insert(), data) which results in many individual row inserts via a call to executemany). On anything but a local connection the difference in performance can be enormous.


回答 5

SQLAlchemy在版本中引入了该功能1.0.0

批量操作-SQLAlchemy文档

通过这些操作,您现在可以批量插入或更新!

例如(如果您希望简单表INSERT的开销最小),可以使用Session.bulk_insert_mappings()

loadme = [(1, 'a'),
          (2, 'b'),
          (3, 'c')]
dicts = [dict(bar=t[0], fly=t[1]) for t in loadme]

s = Session()
s.bulk_insert_mappings(Foo, dicts)
s.commit()

或者,如果需要,可以跳过loadme元组,直接将字典写进去dicts(但是我发现将所有的单词遗漏在数据之外并循环加载字典列表会更容易)。

SQLAlchemy introduced that in version 1.0.0:

Bulk operations – SQLAlchemy docs

With these operations, you can now do bulk inserts or updates!

For instance (if you want the lowest overhead for simple table INSERTs), you can use Session.bulk_insert_mappings():

loadme = [(1, 'a'),
          (2, 'b'),
          (3, 'c')]
dicts = [dict(bar=t[0], fly=t[1]) for t in loadme]

s = Session()
s.bulk_insert_mappings(Foo, dicts)
s.commit()

Or, if you want, skip the loadme tuples and write the dictionaries directly into dicts (but I find it easier to leave all the wordiness out of the data and load up a list of dictionaries in a loop).


回答 6

Piere的回答是正确的,但是一个问题是bulk_save_objects,如果您担心的话,默认情况下不会返回对象的主键。设置return_defaultsTrue可得到此行为。

文档在这里

foos = [Foo(bar='a',), Foo(bar='b'), Foo(bar='c')]
session.bulk_save_objects(foos, return_defaults=True)
for foo in foos:
    assert foo.id is not None
session.commit()

Piere’s answer is correct but one issue is that bulk_save_objects by default does not return the primary keys of the objects, if that is of concern to you. Set return_defaults to True to get this behavior.

The documentation is here.

foos = [Foo(bar='a',), Foo(bar='b'), Foo(bar='c')]
session.bulk_save_objects(foos, return_defaults=True)
for foo in foos:
    assert foo.id is not None
session.commit()

回答 7

条条大路通罗马,但其中一些横穿山脉,需要渡轮,但是如果您想快速到达那儿,只需上高速公路。


在这种情况下,高速公路将使用psycopg2execute_batch()功能。该文档说的最好:

当前的实现executemany()(使用非常慈善的轻描淡写)不是特别有效。这些功能可用于加快针对一组参数的语句的重复执行。通过减少服务器往返次数,性能可以比使用服务器好几个数量级。executemany()

在我自己的测试execute_batch()快2倍左右executemany(),并给出配置进行进一步的调整所以page_size的选项(如果你想挤进业绩的最后2-3%的驾驶者)。

如果使用SQLAlchemy,则可以通过use_batch_mode=True在实例化引擎时将其设置为参数来轻松启用相同功能。create_engine()

All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.


In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().

In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()


回答 8

这是一种方法:

values = [1, 2, 3]
Foo.__table__.insert().execute([{'bar': x} for x in values])

这样插入:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

参考:SQLAlchemy FAQ包含各种提交方法的基准。

This is a way:

values = [1, 2, 3]
Foo.__table__.insert().execute([{'bar': x} for x in values])

This will insert like this:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

Reference: The SQLAlchemy FAQ includes benchmarks for various commit methods.


回答 9

到目前为止,我发现的最佳答案是在sqlalchemy文档中:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#im-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

有一个完整示例说明了可能的解决方案基准。

如文档所示:

bulk_save_objects不是最佳解决方案,但其性能是正确的。

就可读性而言,第二好的实现是我认为使用SQLAlchemy Core:

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
            [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )

文档文章中提供了此功能的上下文。

The best answer I found so far was in sqlalchemy documentation:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

There is a complete example of a benchmark of possible solutions.

As shown in the documentation:

bulk_save_objects is not the best solution but it performance are correct.

The second best implementation in terms of readability I think was with the SQLAlchemy Core:

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
            [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )

The context of this function is given in the documentation article.


如何检查python pandas中列的dtype

问题:如何检查python pandas中列的dtype

我需要使用不同的函数来处理数字列和字符串列。我现在正在做的事情真是愚蠢:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

有没有更优雅的方法可以做到这一点?例如

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

Is there a more elegant way to do this? E.g.

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

回答 0

您可以使用以下命令访问列的数据类型dtype

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

You can access the data-type of a column with dtype:

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

回答 1

pandas 0.20.2你可以这样做:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

因此,您的代码变为:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

In pandas 0.20.2 you can do:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

So your code becomes:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

回答 2

我知道这有点旧,但是使用熊猫19.02,您可以执行以下操作:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html

I know this is a bit of an old thread but with pandas 19.02, you can do:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html


回答 3

问题标题是一般性的,但问题正文中所述的作者用例是特定的。因此,可以使用任何其他答案。

但是,为了完全回答标题问题,应澄清所有方法似乎在某些情况下可能会失败,并且需要进行一些重新设计。我以降低可靠性的顺序(我认为)对所有这些(以及其他一些)进行了审查:

1.通过==(接受的答案)直接比较类型。

尽管这是公认的答案,并且投票最多,但我认为完全不应使用此方法。因为实际上,这种方法在python中不建议使用,如这里多次提到的。
但是,如果仍然想使用它-应该知道像一些熊猫专用dtypes的pd.CategoricalDTypepd.PeriodDtypepd.IntervalDtypetype( )为了正确识别dtype,这里必须使用extra :

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

这里的另一个警告是应该精确指出类型:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance()方法。

到目前为止,尚未在答案中提及此方法。

因此,如果直接比较类型不是一个好主意-为此,请尝试使用内置的python函数,即- isinstance()
它会在一开始就失败,因为它假定我们有一些对象,但是pd.Series或者pd.DataFrame可能只用作带有预定义dtype但没有对象的空容器:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

但是,如果有人以某种方式克服了这个问题,并且想要访问每个对象,例如,在第一行中,并像这样检查其dtype:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

在单列中混合类型的数据时,这将产生误导:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

最后但并非最不重要的一点-此方法无法直接识别Categorydtype。如文档所述

从分类数据返回单个项目也将返回值,而不是长度为“ 1”的分类。

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

因此,这种方法几乎也不适用。

3. df.dtype.kind方法。

此方法可能与空方法一起使用,pd.Series或者pd.DataFrames还有其他问题。

首先-无法区分某些dtype:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

第二,实际上我仍然不清楚,它甚至在某些dtypes返回None

4. df.select_dtypes方法。

这几乎是我们想要的。此方法在pandas内部设计,因此可以处理前面提到的大多数极端情况-空的DataFrame,与numpy或特定于pandas的dtypes完全不同。与dtype这样的单个dtype一起使用时效果很好.select_dtypes('bool')。它甚至可以用于基于dtype选择列组:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

就像文档中所述:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

在可能会认为这里我们看到的第一个意外结果(过去对我来说是:问题)- TimeDelta被包含在输出中DataFrame。但是,正如相反的回答,应该是这样,但是必须意识到这一点。请注意,bool跳过了dtype,这对于某些人来说也是不希望的,但这是由于boolnumber位于numpy dtype的不同“ 子树 ”中。如果是布尔型,我们可以test.select_dtypes(['bool'])在这里使用。

此方法的下一个限制是,对于当前版本的Pandas(0.24.2),此代码:test.select_dtypes('period')将引发NotImplementedError

另一件事是它无法将字符串与其他对象区分开:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

但这首先是- 在文档中已经提到。其次-不是此方法的问题,而是字符串存储在中的方式DataFrame。但是无论如何,这种情况必须进行一些后期处理。

5. df.api.types.is_XXX_dtype方法。

我猜想这是实现dtype识别(函数所在的模块的路径本身说)的最健壮和本机的方式。它几乎可以完美地工作,但是仍然至少有一个警告,并且仍然必须以某种方式区分字符串列

此外,这可能是主观的,但是与以下方法相比,该方法还具有更多的“人类可理解”的numberdtypes组处理.select_dtypes('number')

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

timedeltabool包括在内。完善。

我的管道此时恰好利用了此功能,以及一些后期处理。

输出。

希望我能够论点的主要观点-所有讨论的方法可以使用,但只能pd.DataFrame.select_dtypes()pd.api.types.is_XXX_dtype必须真正视为适用的。

Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.

But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):

1. Comparing types directly via == (accepted answer).

Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it – should be aware of some pandas-specific dtypes like pd.CategoricalDType, pd.PeriodDtype, or pd.IntervalDtype. Here one have to use extra type( ) in order to recognize dtype correctly:

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

Another caveat here is that type should be pointed out precisely:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance() approach.

This method has not been mentioned in answers so far.

So if direct comparing of types is not a good idea – lets try built-in python function for this purpose, namely – isinstance().
It fails just in the beginning, because assumes that we have some objects, but pd.Series or pd.DataFrame may be used as just empty containers with predefined dtype but no objects in it:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

It will be misleading in the case of mixed type of data in single column:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

And last but not least – this method cannot directly recognize Category dtype. As stated in docs:

Returning a single item from categorical data will also return the value, not a categorical of length “1”.

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

So this method is also almost inapplicable.

3. df.dtype.kind approach.

This method yet may work with empty pd.Series or pd.DataFrames but has another problems.

First – it is unable to differ some dtypes:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

Second, what is actually still unclear for me, it even returns on some dtypes None.

4. df.select_dtypes approach.

This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier – empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool'). It may be used even for selecting groups of columns based on dtype:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

Like so, as stated in the docs:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

On may think that here we see first unexpected (at used to be for me: question) results – TimeDelta is included into output DataFrame. But as answered in contrary it should be so, but one have to be aware of it. Note that bool dtype is skipped, that may be also undesired for someone, but it’s due to bool and number are in different “subtrees” of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool']) here.

Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period') will raise NotImplementedError.

And another thing is that it’s unable to differ strings from other objects:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

But this is, first – already mentioned in the docs. And second – is not the problem of this method, rather the way strings are stored in DataFrame. But anyway this case have to have some post processing.

5. df.api.types.is_XXX_dtype approach.

This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.

Besides, this may be subjective, but this approach also has more ‘human-understandable’ number dtypes group processing comparing with .select_dtypes('number'):

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

No timedelta and bool is included. Perfect.

My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.

Output.

Hope I was able to argument the main point – that all discussed approaches may be used, but only pd.DataFrame.select_dtypes() and pd.api.types.is_XXX_dtype should be really considered as the applicable ones.


回答 4

如果要将数据框列的类型标记为字符串,则可以执行以下操作:

df['A'].dtype.kind

一个例子:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

您的代码的答案:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

If you want to mark the type of a dataframe column as a string, you can do:

df['A'].dtype.kind

An example:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

The answer for your code:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Note:


回答 5

漂亮地打印列数据类型

在例如从文件导入后检查数据类型

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

说明性输出:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

To pretty print the column data types

To check the data types after, for example, an import from a file

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

Illustrative output:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

‘str’对象不支持Python中的项目分配

问题:’str’对象不支持Python中的项目分配

我想从字符串中读取一些字符,然后将其放入其他字符串中(就像我们在C语言中一样)。

所以我的代码如下

import string
import re
str = "Hello World"
j = 0
srr = ""
for i in str:
    srr[j] = i #'str' object does not support item assignment 
    j = j + 1
print (srr)

在C中,代码可能是

i = j = 0; 
while(str[i] != '\0')
{
srr[j++] = str [i++];
}

如何在Python中实现相同的功能?

I would like to read some characters from a string and put it into other string (Like we do in C).

So my code is like below

import string
import re
str = "Hello World"
j = 0
srr = ""
for i in str:
    srr[j] = i #'str' object does not support item assignment 
    j = j + 1
print (srr)

In C the code may be

i = j = 0; 
while(str[i] != '\0')
{
srr[j++] = str [i++];
}

How can I implement the same in Python?


回答 0

在Python中,字符串是不可变的,因此您不能就地更改其字符。

但是,您可以执行以下操作:

for i in str:
    srr += i

起作用的原因是它是以下操作的快捷方式:

for i in str:
    srr = srr + i

上面的代码在每次迭代时都会创建一个新字符串,并将对该新字符串的引用存储在中srr

In Python, strings are immutable, so you can’t change their characters in-place.

You can, however, do the following:

for i in str:
    srr += i

The reasons this works is that it’s a shortcut for:

for i in str:
    srr = srr + i

The above creates a new string with each iteration, and stores the reference to that new string in srr.


回答 1

其他答案是正确的,但是您当然可以执行以下操作:

>>> str1 = "mystring"
>>> list1 = list(str1)
>>> list1[5] = 'u'
>>> str1 = ''.join(list1)
>>> print(str1)
mystrung
>>> type(str1)
<type 'str'>

如果您真的想要。

The other answers are correct, but you can, of course, do something like:

>>> str1 = "mystring"
>>> list1 = list(str1)
>>> list1[5] = 'u'
>>> str1 = ''.join(list1)
>>> print(str1)
mystrung
>>> type(str1)
<type 'str'>

if you really want to.


回答 2

Python字符串是不可变的,因此您在C语言中尝试执行的操作在python中根本不可能实现。您将必须创建一个新字符串。

我想从字符串中读取一些字符,然后将其放入其他字符串中。

然后使用字符串切片:

>>> s1 = 'Hello world!!'
>>> s2 = s1[6:12]
>>> print s2
world!

Python strings are immutable so what you are trying to do in C will be simply impossible in python. You will have to create a new string.

I would like to read some characters from a string and put it into other string.

Then use a string slice:

>>> s1 = 'Hello world!!'
>>> s2 = s1[6:12]
>>> print s2
world!

回答 3

如aix所述-Python中的字符串是不可变的(您不能就地更改它们)。

您要尝试执行的操作可以通过多种方式完成:

# Copy the string

foo = 'Hello'
bar = foo

# Create a new string by joining all characters of the old string

new_string = ''.join(c for c in oldstring)

# Slice and copy
new_string = oldstring[:]

As aix mentioned – strings in Python are immutable (you cannot change them inplace).

What you are trying to do can be done in many ways:

# Copy the string

foo = 'Hello'
bar = foo

# Create a new string by joining all characters of the old string

new_string = ''.join(c for c in oldstring)

# Slice and copy
new_string = oldstring[:]

回答 4

如果您想将特定字符换成另一个字符,则可以采用另一种方法:

def swap(input_string):
   if len(input_string) == 0:
     return input_string
   if input_string[0] == "x":
     return "y" + swap(input_string[1:])
   else:
     return input_string[0] + swap(input_string[1:])

Another approach if you wanted to swap out a specific character for another character:

def swap(input_string):
   if len(input_string) == 0:
     return input_string
   if input_string[0] == "x":
     return "y" + swap(input_string[1:])
   else:
     return input_string[0] + swap(input_string[1:])

回答 5

该解决方案如何:

str =“ Hello World”(如问题所述)srr = str +“”

How about this solution:

str=”Hello World” (as stated in problem) srr = str+ “”


回答 6

嗨,您应该尝试使用字符串拆分方法:

i = "Hello world"
output = i.split()

j = 'is not enough'

print 'The', output[1], j

Hi you should try the string split method:

i = "Hello world"
output = i.split()

j = 'is not enough'

print 'The', output[1], j

有趣好用的Python教程

退出移动版
微信支付
请使用 微信 扫码支付