Python-如何在python中验证网址?(格式是否正确)

问题:Python-如何在python中验证网址?(格式是否正确)

url来自用户,我必须用获取的HTML进行回复。

如何检查URL格式是否正确?

例如 :

url='google'  // Malformed
url='google.com'  // Malformed
url='http://google.com'  // Valid
url='http://google'   // Malformed

我们怎样才能做到这一点?

I have url from the user and I have to reply with the fetched HTML.

How can I check for the URL to be malformed or not?

For example :

url='google'  // Malformed
url='google.com'  // Malformed
url='http://google.com'  // Valid
url='http://google'   // Malformed

回答 0

django url验证正则表达式():

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

django url validation regex (source):

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

回答 1

实际上,我认为这是最好的方法。

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

如果设置verify_existsTrue,它将实际验证该URL是否存在,否则将仅检查其格式是否正确。

编辑:是的,这个问题是重复的:我如何检查Django的验证程序是否存在URL?

Actually, I think this is the best way.

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

If you set verify_exists to True, it will actually verify that the URL exists, otherwise it will just check if it’s formed correctly.

edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django’s validators?


回答 2

使用验证程序包:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

使用pip()从PyPI安装它pip install validators

Use the validators package:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

Install it from PyPI with pip (pip install validators).


回答 3

基于@DMfll的真假版本:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))

给出:

True
False
False
False

A True or False version, based on @DMfll answer:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))

Gives:

True
False
False
False

回答 4

如今,我根据Padam的回答使用以下内容:

$ python --version
Python 3.6.5

这是它的外观:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

只需使用is_url("http://www.asdf.com")

希望能帮助到你!

Nowadays, I use the following, based on the Padam’s answer:

$ python --version
Python 3.6.5

And this is how it looks:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

Just use is_url("http://www.asdf.com").

Hope it helps!


回答 5

注意 -lepl不再受支持,对不起(欢迎您使用它,我认为以下代码可以运行,但不会获得更新)。

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定义了如何执行此操作(针对http网址和电子邮件)。我使用lepl(解析器库)在python中实现了其建议。看到http://acooke.org/lepl/rfc3696.html

使用:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

note – lepl is no longer supported, sorry (you’re welcome to use it, and i think the code below works, but it’s not going to get updates).

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html

to use:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

回答 6

我登陆此页面,试图找出一种合理的方法来将字符串验证为“有效” URL。我在这里分享我使用python3的解决方案。无需额外的库。

如果您使用的是python2,请参见https://docs.python.org/2/library/urlparse.html

如果您按原样使用python3,请参阅https://docs.python.org/3.0/library/urllib.parse.html

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme =”,netloc =”,path =’dkakasdkjdjakdjadjfalskdjfalk’,params =”,query =”,fragment =“”)

ParseResult(scheme =’https’,netloc =’stackoverflow.com’,path =”,params =”,query =”,fragment =“”)

‘dkakasdkjdjakdjadjfalskdjfalk’字符串没有方案或netloc。

https://stackoverflow.com ”可能是有效的网址。

这是一个更简洁的功能:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

I landed on this page trying to figure out a sane way to validate strings as “valid” urls. I share here my solution using python3. No extra libraries required.

See https://docs.python.org/2/library/urlparse.html if you are using python2.

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme=”, netloc=”, path=’dkakasdkjdjakdjadjfalskdjfalk’, params=”, query=”, fragment=”)

ParseResult(scheme=’https’, netloc=’stackoverflow.com’, path=”, params=”, query=”, fragment=”)

‘dkakasdkjdjakdjadjfalskdjfalk’ string has no scheme or netloc.

https://stackoverflow.com‘ is probably a valid url.

Here is a more concise function:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

回答 7

编辑

正如@Kwame指出的那样,即使.comor .co等不存在,下面的代码也会验证网址。

@Blaise也指出,类似https://www.google的URL是有效的URL,您需要分别进行DNS检查以检查其是否解析。

这很简单并且有效:

因此min_attr包含定义URL有效性(即http://google.com一部分)所需出现的基本字符串集。

urlparse.scheme商店http://

urlparse.netloc 存储域名 google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all()如果其中所有变量都返回true,则返回true。因此,如果result.schemeresult.netloc存在(即具有某个值),则该URL有效,因此返回True

EDIT

As pointed out by @Kwame , the below code does validate the url even if the .com or .co etc are not present.

also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately.

This is simple and works:

So min_attr contains the basic set of strings that needs to be present to define the validity of a URL, i.e http:// part and google.com part.

urlparse.scheme stores http:// and

urlparse.netloc store the domain name google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all() returns true if all the variables inside it return true. So if result.scheme and result.netloc is present i.e. has some value then the URL is valid and hence returns True.


回答 8

使用urllib和类似Django的正则表达式验证URL

Django URL验证正则表达式实际上非常好,但是我需要针对我的用例进行一些调整。 随时适应您的需求!

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

说明

  • 该代码仅验证给定URL 的schemenetloc部分。(为了正确执行此操作,我将URL分为urllib.parse.urlparse()两个相应的部分,然后与相应的正则表达式匹配。)
  • netloc零件在第一次出现斜线之前停止/,因此port数字仍然是的一部分netloc,例如:

    https://www.google.com:80/search?q=python
    ^^^^^   ^^^^^^^^^^^^^^^^^
      |             |      
      |             +-- netloc (aka "domain" in my code)
      +-- scheme
  • IPv4地址也经过验证

IPv6支持

如果您希望URL验证程序也可以使用IPv6地址,请执行以下操作:

例子

以下是实际netloc(aka domain)部分的正则表达式示例:

Validate URL with urllib and Django-like regex

The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

Explanation

  • The code only validates the scheme and netloc part of a given URL. (To do this properly, I split the URL with urllib.parse.urlparse() in the two according parts which are then matched with the corresponding regex terms.)
  • The netloc part stops before the first occurrence of a slash /, so port numbers are still part of the netloc, e.g.:

    https://www.google.com:80/search?q=python
    ^^^^^   ^^^^^^^^^^^^^^^^^
      |             |      
      |             +-- netloc (aka "domain" in my code)
      +-- scheme
    
  • IPv4 addresses are also validated

IPv6 Support

If you want the URL validator to also work with IPv6 addresses, do the following:

  • Add is_valid_ipv6(ip) from Markus Jarderot’s answer, which has a really good IPv6 validator regex
  • Add and not is_valid_ipv6(domain) to the last if

Examples

Here are some examples of the regex for the netloc (aka domain) part in action:


回答 9

以上所有解决方案都将“ http://www.google.com/path,www.yahoo.com/path ”之类的字符串识别为有效。此解决方案始终可以正常工作

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

All of the above solutions recognize a string like “http://www.google.com/path,www.yahoo.com/path” as valid. This solution always works as it should

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

合并两个列表并删除重复项,而不删除原始列表中的重复项

问题:合并两个列表并删除重复项,而不删除原始列表中的重复项

我需要合并两个列表,其中第二个列表将忽略第一个列表的任何重复项。..有点难以解释,所以让我展示一个代码看起来像什么,以及我想要什么的示例。

first_list = [1, 2, 2, 5]

second_list = [2, 5, 7, 9]

# The result of combining the two lists should result in this list:
resulting_list = [1, 2, 2, 5, 7, 9]

您会注意到结果具有第一个列表,包括其两个“ 2”值,但是second_list也具有附加的2和5值这一事实并未添加到第一个列表中。

通常,对于这样的事情,我会使用集合,但是first_list上的集合会清除它已经具有的重复值。所以我只是想知道什么是实现此所需组合的最佳/最快方法。

谢谢。

I have two lists that i need to combine where the second list has any duplicates of the first list ignored. .. A bit hard to explain, so let me show an example of what the code looks like, and what i want as a result.

first_list = [1, 2, 2, 5]

second_list = [2, 5, 7, 9]

# The result of combining the two lists should result in this list:
resulting_list = [1, 2, 2, 5, 7, 9]

You’ll notice that the result has the first list, including its two “2” values, but the fact that second_list also has an additional 2 and 5 value is not added to the first list.

Normally for something like this i would use sets, but a set on first_list would purge the duplicate values it already has. So i’m simply wondering what the best/fastest way to achieve this desired combination.

Thanks.


回答 0

您需要将第二个列表中不在第一个列表中的那些元素添加到第一个列表中-集是确定它们是哪些元素的最简单方法,如下所示:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

in_first = set(first_list)
in_second = set(second_list)

in_second_but_not_in_first = in_second - in_first

result = first_list + list(in_second_but_not_in_first)
print(result)  # Prints [1, 2, 2, 5, 9, 7]

或者,如果您更喜欢单线8-)

print(first_list + list(set(second_list) - set(first_list)))

You need to append to the first list those elements of the second list that aren’t in the first – sets are the easiest way of determining which elements they are, like this:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

in_first = set(first_list)
in_second = set(second_list)

in_second_but_not_in_first = in_second - in_first

result = first_list + list(in_second_but_not_in_first)
print(result)  # Prints [1, 2, 2, 5, 9, 7]

Or if you prefer one-liners 8-)

print(first_list + list(set(second_list) - set(first_list)))

回答 1

resulting_list = list(first_list)
resulting_list.extend(x for x in second_list if x not in resulting_list)
resulting_list = list(first_list)
resulting_list.extend(x for x in second_list if x not in resulting_list)

回答 2

您可以使用集:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

resultList= list(set(first_list) | set(second_list))

print(resultList)
# Results in : resultList = [1,2,5,7,9]

You can use sets:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

resultList= list(set(first_list) | set(second_list))

print(resultList)
# Results in : resultList = [1,2,5,7,9]

回答 3

如果使用numpy,则可以将其简化为一行代码:

a = [1,2,3,4,5,6,7]
b = [2,4,7,8,9,10,11,12]

sorted(np.unique(a+b))

>>> [1,2,3,4,5,6,7,8,9,10,11,12]

You can bring this down to one single line of code if you use numpy:

a = [1,2,3,4,5,6,7]
b = [2,4,7,8,9,10,11,12]

sorted(np.unique(a+b))

>>> [1,2,3,4,5,6,7,8,9,10,11,12]

回答 4

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

print( set( first_list + second_list ) )
first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

print( set( first_list + second_list ) )

回答 5

resulting_list = first_list + [i for i in second_list if i not in first_list]
resulting_list = first_list + [i for i in second_list if i not in first_list]

回答 6

对我来说最简单的是:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

merged_list = list(set(first_list+second_list))
print(merged_list)

#prints [1, 2, 5, 7, 9]

Simplest to me is:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

merged_list = list(set(first_list+second_list))
print(merged_list)

#prints [1, 2, 5, 7, 9]

回答 7

您还可以结合RichieHindle和Ned Batchelder的响应,得到保留顺序的平均情况 O(m + n)算法:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

fs = set(first_list)
resulting_list = first_list + [x for x in second_list if x not in fs]

assert(resulting_list == [1, 2, 2, 5, 7, 9])

请注意,x in s它的最坏情况复杂度为O(m),因此此代码的最坏情况复杂度仍为O(m * n)

You can also combine RichieHindle’s and Ned Batchelder’s responses for an average-case O(m+n) algorithm that preserves order:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

fs = set(first_list)
resulting_list = first_list + [x for x in second_list if x not in fs]

assert(resulting_list == [1, 2, 2, 5, 7, 9])

Note that x in s has a worst-case complexity of O(m), so the worst-case complexity of this code is still O(m*n).


回答 8

这可能有帮助

def union(a,b):
    for e in b:
        if e not in a:
            a.append(e)

union函数将第二个列表合并为第一个列表,而不复制a的元素(如果已经存在于a中)。与设置联合运算符相似。此功能不变b。如果a = [1,2,3] b = [2,3,4]。在union(a,b)使a = [1,2,3,4]和b = [2,3,4]之后

This might help

def union(a,b):
    for e in b:
        if e not in a:
            a.append(e)

The union function merges the second list into first, with out duplicating an element of a, if it’s already in a. Similar to set union operator. This function does not change b. If a=[1,2,3] b=[2,3,4]. After union(a,b) makes a=[1,2,3,4] and b=[2,3,4]


回答 9

根据配方

result_list = list(set()。union(first_list,second_list))

Based on the recipe :

resulting_list = list(set().union(first_list, second_list))


回答 10

    first_list = [1, 2, 2, 5]
    second_list = [2, 5, 7, 9]

    newList=[]
    for i in first_list:
        newList.append(i)
    for z in second_list:
        if z not in newList:
            newList.append(z)
    newList.sort()
    print newList

[1、2、2、5、7、9]

    first_list = [1, 2, 2, 5]
    second_list = [2, 5, 7, 9]

    newList=[]
    for i in first_list:
        newList.append(i)
    for z in second_list:
        if z not in newList:
            newList.append(z)
    newList.sort()
    print newList

[1, 2, 2, 5, 7, 9]


如何从Python的文件路径中提取文件夹路径?

问题:如何从Python的文件路径中提取文件夹路径?

我只想从完整路径到文件获取文件夹路径。

例如T:\Data\DBDesign\DBDesign_93_v141b.mdb,我想要得到T:\Data\DBDesign(不包括\DBDesign_93_v141b.mdb)。

我已经尝试过这样的事情:

existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = str(existGDBPath.split('\\')[0:-1])
print wkspFldr 

但是它给了我这样的结果:

['T:', 'Data', 'DBDesign']

这不是我需要的结果(为T:\Data\DBDesign)。

关于如何获取文件路径的任何想法?

I would like to get just the folder path from the full path to a file.

For example T:\Data\DBDesign\DBDesign_93_v141b.mdb and I would like to get just T:\Data\DBDesign (excluding the \DBDesign_93_v141b.mdb).

I have tried something like this:

existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = str(existGDBPath.split('\\')[0:-1])
print wkspFldr 

but it gave me a result like this:

['T:', 'Data', 'DBDesign']

which is not the result that I require (being T:\Data\DBDesign).

Any ideas on how I can get the path to my file?


回答 0

您几乎可以使用该split功能了。您只需要加入字符串,如下所示。

>>> import os
>>> '\\'.join(existGDBPath.split('\\')[0:-1])
'T:\\Data\\DBDesign'

虽然,我建议使用该os.path.dirname函数来执行此操作,但是您只需要传递字符串即可,它将为您完成工作。由于您似乎在Windows上,因此也考虑使用该abspath功能。一个例子:

>>> import os
>>> os.path.dirname(os.path.abspath(existGDBPath))
'T:\\Data\\DBDesign'

如果要在拆分后同时需要文件名和目录路径,则可以使用os.path.split返回元组的函数,如下所示。

>>> import os
>>> os.path.split(os.path.abspath(existGDBPath))
('T:\\Data\\DBDesign', 'DBDesign_93_v141b.mdb')

You were almost there with your use of the split function. You just needed to join the strings, like follows.

>>> import os
>>> '\\'.join(existGDBPath.split('\\')[0:-1])
'T:\\Data\\DBDesign'

Although, I would recommend using the os.path.dirname function to do this, you just need to pass the string, and it’ll do the work for you. Since, you seem to be on windows, consider using the abspath function too. An example:

>>> import os
>>> os.path.dirname(os.path.abspath(existGDBPath))
'T:\\Data\\DBDesign'

If you want both the file name and the directory path after being split, you can use the os.path.split function which returns a tuple, as follows.

>>> import os
>>> os.path.split(os.path.abspath(existGDBPath))
('T:\\Data\\DBDesign', 'DBDesign_93_v141b.mdb')

回答 1

带有PATHLIB模块(更新后的答案)

人们应该考虑使用pathlib进行新开发。它在适用于Python3.4的stdlib中,但在PyPI上可用于早期版本。该库提供了一种更面向对象的方法来操作路径<opinion>,并且使用更加容易阅读和编程</opinion>

>>> import pathlib
>>> existGDBPath = pathlib.Path(r'T:\Data\DBDesign\DBDesign_93_v141b.mdb')
>>> wkspFldr = existGDBPath.parent
>>> print wkspFldr
Path('T:\Data\DBDesign')

使用OS模块

使用os.path模块:

>>> import os
>>> existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
>>> wkspFldr = os.path.dirname(existGDBPath)
>>> print wkspFldr 
'T:\Data\DBDesign'

您可以继续假设,如果您需要进行某种文件名操作,则已经在中实现了os.path。如果没有,您可能仍需要将此模块用作构建模块。

WITH PATHLIB MODULE (UPDATED ANSWER)

One should consider using pathlib for new development. It is in the stdlib for Python3.4, but available on PyPI for earlier versions. This library provides a more object-orented method to manipulate paths <opinion> and is much easier read and program with </opinion>.

>>> import pathlib
>>> existGDBPath = pathlib.Path(r'T:\Data\DBDesign\DBDesign_93_v141b.mdb')
>>> wkspFldr = existGDBPath.parent
>>> print wkspFldr
Path('T:\Data\DBDesign')

WITH OS MODULE

Use the os.path module:

>>> import os
>>> existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
>>> wkspFldr = os.path.dirname(existGDBPath)
>>> print wkspFldr 
'T:\Data\DBDesign'

You can go ahead and assume that if you need to do some sort of filename manipulation it’s already been implemented in os.path. If not, you’ll still probably need to use this module as the building block.


回答 2

内置子模块os.path具有用于该任务的功能。

import os
os.path.dirname('T:\Data\DBDesign\DBDesign_93_v141b.mdb')

The built-in submodule os.path has a function for that very task.

import os
os.path.dirname('T:\Data\DBDesign\DBDesign_93_v141b.mdb')

回答 3

这是代码:

import os
existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = os.path.dirname(existGDBPath)
print wkspFldr # T:\Data\DBDesign

Here is the code:

import os
existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = os.path.dirname(existGDBPath)
print wkspFldr # T:\Data\DBDesign

回答 4

这是我的小实用程序帮助程序,用于拆分路径int文件和路径标记:

import os    
# usage: file, path = splitPath(s)
def splitPath(s):
    f = os.path.basename(s)
    p = s[:-(len(f))-1]
    return f, p

Here is my little utility helper for splitting paths int file, path tokens:

import os    
# usage: file, path = splitPath(s)
def splitPath(s):
    f = os.path.basename(s)
    p = s[:-(len(f))-1]
    return f, p

回答 5

任何在ESRI GIS Table字段计算器界面中尝试执行此操作的人都可以使用Python解析器执行此操作:

PathToContainingFolder =

"\\".join(!FullFilePathWithFileName!.split("\\")[0:-1])

以便

\ Users \ me \ Desktop \ New文件夹\ file.txt

变成

\ Users \ me \ Desktop \ New文件夹

Anyone trying to do this in the ESRI GIS Table field calculator interface can do this with the Python parser:

PathToContainingFolder =

"\\".join(!FullFilePathWithFileName!.split("\\")[0:-1])

so that

\Users\me\Desktop\New folder\file.txt

becomes

\Users\me\Desktop\New folder


如何在Jupyter中显示完整输出,而不仅仅是最后结果?

问题:如何在Jupyter中显示完整输出,而不仅仅是最后结果?

我希望Jupyter打印所有交互式输出,而不要打印,而不仅仅是最后的结果。怎么做?

范例:

a=3
a
a+1

我想展示

3
4

I want Jupyter to print all the interactive output without resorting to print, not only the last result. How to do it?

Example :

a=3
a
a+1

I would like to display

3
4


回答 0

感谢Thomas,这是我一直在寻找的解决方案:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Thanks to Thomas, here is the solution I was looking for:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

回答 1

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

1)将此代码放在Jupyter单元格中:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

2)在Windows中,以下步骤使更改永久生效。应该适用于其他操作系统。您可能必须更改路径。

C:\Users\your_profile\\.ipython\profile_default

使用以下代码在profile_defaults中创建一个ipython_config.py文件:

c = get_config()

c.InteractiveShell.ast_node_interactivity = "all"

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

1) Place this code in a Jupyter cell:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

2) In Windows, the steps below makes the change permanent. Should work for other operating systems. You might have to change the path.

C:\Users\your_profile\\.ipython\profile_default

Make a ipython_config.py file in the profile_defaults with the following code:

c = get_config()

c.InteractiveShell.ast_node_interactivity = "all"

回答 2

每个笔记本的基础

正如其他人回答的那样,将以下代码放入Jupyter Lab或Jupyter Notebook单元将起作用:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

永久改变

但是,如果您想永久保留它并使用Jupyter Lab,则需要创建一个IPython Notebook配置文件。运行以下命令来执行此操作(如果您使用Jupyter Notebook,则不要运行-以下是更多详细信息):

ipython profile create

如果您使用的是Jupyter Notebook,则应该已经创建了此文件,因此无需再次运行它。实际上,运行此命令可能会覆盖您当前的首选项。

创建此文件后,对于Jupyter Lab和Notebook用户一样,将以下代码添加到该文件中C:\Users\USERNAME\\.ipython\profile_default\ipython_config.py

c.InteractiveShell.ast_node_interactivity = "all"

我发现c = get_config()在新版本的Jupyter中没有必要 ,但是如果这对您不起作用,请c = get_config()在文件的开头添加。

有关以外的其他标志选项"all",请访问以下链接:https : //ipython.readthedocs.io/en/stable/config/options/terminal.html#configtrait-InteractiveShell.ast_node_interactivity

Per Notebook Basis

As others have answered, putting the following code in a Jupyter Lab or Jupyter Notebook cell will work:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

Permanent Change

However, if you would like to make this permanent and use Jupyter Lab, you will need to create an IPython notebook config file. Run the following command to do so (DO NOT run if you use Jupyter Notebook – more details below):

ipython profile create

If you are using Jupyter Notebook, this file should have already been created and there will be no need to run it again. In fact, running this command may overwrite your current preferences.

Once you have this file created, for Jupyter Lab and Notebook users alike, add the following code to the file C:\Users\USERNAME\\.ipython\profile_default\ipython_config.py:

c.InteractiveShell.ast_node_interactivity = "all"

I found there is no need for c = get_config() in the newer versions of Jupyter, but if this doesn’t work for you, add the c = get_config() to the beginning of the file.

For more flag options other than "all", visit this link: https://ipython.readthedocs.io/en/stable/config/options/terminal.html#configtrait-InteractiveShell.ast_node_interactivity


matplotlib获取ylim值

问题:matplotlib获取ylim值

matplotlib用来从Python 绘制数据(使用ploterrorbar函数)。我必须绘制一组完全独立的图,然后调整它们的ylim值,以便可以轻松地对其进行视觉比较。

如何ylim从每个图检索值,以便分别取下和上ylim值的最小值和最大值,并调整图以便可以对其进行直观比较?

当然,我可以分析数据并提出自己的自定义ylim值…但是我想用它matplotlib来为我做这些。关于如何轻松(高效)执行此操作的任何建议?

这是我的Python函数,使用绘制matplotlib

import matplotlib.pyplot as plt

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    # axes
    axes = plt.gca()
    axes.set_xlim([-0.5, len(values) - 0.5])
    axes.set_xlabel('My x-axis title')
    axes.set_ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

    # close figure
    plt.close(fig)

I’m using matplotlib to plot data (using plot and errorbar functions) from Python. I have to plot a set of totally separate and independent plots, and then adjust their ylim values so they can be easily visually compared.

How can I retrieve the ylim values from each plot, so that I can take the min and max of the lower and upper ylim values, respectively, and adjust the plots so they can be visually compared?

Of course, I could just analyze the data and come up with my own custom ylim values… but I’d like to use matplotlib to do that for me. Any suggestions on how to easily (and efficiently) do this?

Here’s my Python function that plots using matplotlib:

import matplotlib.pyplot as plt

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    # axes
    axes = plt.gca()
    axes.set_xlim([-0.5, len(values) - 0.5])
    axes.set_xlabel('My x-axis title')
    axes.set_ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

    # close figure
    plt.close(fig)

回答 0

只是使用axes.get_ylim(),它非常类似于set_ylim。从文档

get_ylim()

获取y轴范围[底部,顶部]

Just use axes.get_ylim(), it is very similar to set_ylim. From the docs:

get_ylim()

Get the y-axis range [bottom, top]


回答 1

 ymin, ymax = axes.get_ylim()

如果plt直接使用api,则可以避免完全调用轴:

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    plt.xlim([-0.5, len(values) - 0.5])
    plt.xlabel('My x-axis title')
    plt.ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

   # close figure
    plt.close(fig)
 ymin, ymax = axes.get_ylim()

If you are using the plt api directly, you can avoid calls to the axes altogether:

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    plt.xlim([-0.5, len(values) - 0.5])
    plt.xlabel('My x-axis title')
    plt.ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

   # close figure
    plt.close(fig)

回答 2

利用上面的好答案,并假设您仅按如下方式使用plt

import matplotlib.pyplot as plt

那么您可以使用plt.axis()以下示例中的方法获得所有四个绘图限制。

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7, 8]  # fake data
y = [1, 2, 3, 4, 3, 2, 5, 6]

plt.plot(x, y, 'k')

xmin, xmax, ymin, ymax = plt.axis()

s = 'xmin = ' + str(round(xmin, 2)) + ', ' + \
    'xmax = ' + str(xmax) + '\n' + \
    'ymin = ' + str(ymin) + ', ' + \
    'ymax = ' + str(ymax) + ' '

plt.annotate(s, (1, 5))

plt.show()

上面的代码应产生以下输出图。

Leveraging from the good answers above and assuming you were only using plt as in

import matplotlib.pyplot as plt

then you can get all four plot limits using plt.axis() as in the following example.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7, 8]  # fake data
y = [1, 2, 3, 4, 3, 2, 5, 6]

plt.plot(x, y, 'k')

xmin, xmax, ymin, ymax = plt.axis()

s = 'xmin = ' + str(round(xmin, 2)) + ', ' + \
    'xmax = ' + str(xmax) + '\n' + \
    'ymin = ' + str(ymin) + ', ' + \
    'ymax = ' + str(ymax) + ' '

plt.annotate(s, (1, 5))

plt.show()

The above code should produce the following output plot.


Pandas DataFrame到列表列表

问题:Pandas DataFrame到列表列表

将列表列表转换为pandas数据框很容易:

import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])

但是,如何将df重新变成列表列表?

lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]

It’s easy to turn a list of lists into a pandas dataframe:

import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])

But how do I turn df back into a list of lists?

lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]

回答 0

您可以访问基础数组并调用其tolist方法:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]

You could access the underlying array and call its tolist method:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]

回答 1

如果数据具有要保留的列标签和索引标签,则有一些选项。

示例数据:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
       columns=('first', 'second', 'third'), \
       index=('alpha', 'beta')) 
>>> df
       first  second  third
alpha      1       2      3
beta       3       4      5

tolist()其他答案中描述方法很有用,但仅生成核心数据-可能还不够,具体取决于您的需求。

>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]

一种方法是使用将转换DataFrame为json df.to_json(),然后再次解析。这很麻烦,但确实具有一些优点,因为该to_json()方法具有一些有用的选项。

>>> df.to_json()
{
  "first":{"alpha":1,"beta":3},
  "second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}

>>> df.to_json(orient='split')
{
 "columns":["first","second","third"],
 "index":["alpha","beta"],
 "data":[[1,2,3],[3,4,5]]
}

繁琐,但可能有用。

好消息是,为列和行建立列表非常简单:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]

这样生成:

>>> print(f"columns: {columns}\nrows: {rows}") 
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

如果None索引的名称令人讨厌,则将其重命名:

df = df.rename_axis('stage')

然后:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}") 

columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

If the data has column and index labels that you want to preserve, there are a few options.

Example data:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
       columns=('first', 'second', 'third'), \
       index=('alpha', 'beta')) 
>>> df
       first  second  third
alpha      1       2      3
beta       3       4      5

The tolist() method described in other answers is useful but yields only the core data – which may not be enough, depending on your needs.

>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]

One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.

>>> df.to_json()
{
  "first":{"alpha":1,"beta":3},
  "second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}

>>> df.to_json(orient='split')
{
 "columns":["first","second","third"],
 "index":["alpha","beta"],
 "data":[[1,2,3],[3,4,5]]
}

Cumbersome but may be useful.

The good news is that it’s pretty straightforward to build lists for the columns and rows:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]

This yields:

>>> print(f"columns: {columns}\nrows: {rows}") 
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

If the None as the name of the index is bothersome, rename it:

df = df.rename_axis('stage')

Then:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}") 

columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

回答 2

我不知道它是否适合您的需求,但您也可以这样做:

>>> lol = df.values
>>> lol
array([[1, 2, 3],
       [3, 4, 5]])

这只是ndarray模块中的一个numpy数组,可让您执行所有常见的numpy数组操作。

I don’t know if it will fit your needs, but you can also do:

>>> lol = df.values
>>> lol
array([[1, 2, 3],
       [3, 4, 5]])

This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.


回答 3

我想保留索引,因此我针对该解决方案调整了原始答案:

list_df = df.reset_index().values.tolist()

现在,您可以将其粘贴到其他位置(例如,粘贴到“堆栈溢出”问题中),然后重新创建它:

pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)

I wanted to preserve the index, so I adapted the original answer to this solution:

list_df = df.reset_index().values.tolist()

Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:

pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)

回答 4

也许情况有所改变,但这返回了ndarrays列表,可以满足我的需要。

list(df.values)

Maybe something changed but this gave back a list of ndarrays which did what I needed.

list(df.values)

回答 5

注意:我在堆栈溢出中看到了很多情况,其中完全不需要将Pandas Series或DataFrame转换为NumPy数组或纯Python列表。如果您不熟悉该库,请考虑仔细检查那些Pandas对象是否已经提供了所需的功能。

引用@jpp 的评论

在实践中,通常不需要将NumPy数组转换为列表列表。


如果Pandas DataFrame / Series不起作用,则可以使用内置的DataFrame.to_numpySeries.to_numpy方法。

Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you’re new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.

To quote a comment by @jpp:

In practice, there’s often no need to convert the NumPy array into a list of lists.


If a Pandas DataFrame/Series won’t work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.


回答 6

这很简单:

import numpy as np

list_of_lists = np.array(df)

This is very simple:

import numpy as np

list_of_lists = np.array(df)

回答 7

我们可以使用DataFrame.iterrows()函数遍历给定Dataframe的每一行,并根据每一行的数据构造一个列表:

# Empty list 
row_list =[] 

# Iterate over each row 
for index, rows in df.iterrows(): 
    # Create list for the current row 
    my_list =[rows.Date, rows.Event, rows.Cost] 

    # append the list to the final list 
    row_list.append(my_list) 

# Print 
print(row_list) 

我们可以成功地将给定数据帧的每一行提取到一个列表中

We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:

# Empty list 
row_list =[] 

# Iterate over each row 
for index, rows in df.iterrows(): 
    # Create list for the current row 
    my_list =[rows.Date, rows.Event, rows.Cost] 

    # append the list to the final list 
    row_list.append(my_list) 

# Print 
print(row_list) 

We can successfully extract each row of the given data frame into a list


sklearn中的“ transform”和“ fit_transform”有什么区别

问题:sklearn中的“ transform”和“ fit_transform”有什么区别

在sklearn-python工具箱中,有两个函数transformfit_transformabout sklearn.decomposition.RandomizedPCA。两种功能的说明如下

但是它们之间有什么区别?

In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows

But what is the difference between them ?


回答 0

在这里,仅当您已经在矩阵上计算了PCA时,才可以使用pca.transform的区别

   In [12]: pc2 = RandomizedPCA(n_components=3)

    In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-13-e3b6b8ea2aff> in <module>()
    ----> 1 pc2.transform(X)

    /usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
        714         # XXX remove scipy.sparse support here in 0.16
        715         X = atleast2d_or_csr(X)
    --> 716         if self.mean_ is not None:
        717             X = X - self.mean_
        718 

    AttributeError: 'RandomizedPCA' object has no attribute 'mean_'

    In [14]: pc2.ftransform(X) 
    pc2.fit            pc2.fit_transform  

    In [14]: pc2.fit_transform(X)
    Out[14]: 
    array([[-1.38340578, -0.2935787 ],
           [-2.22189802,  0.25133484],
           [-3.6053038 , -0.04224385],
           [ 1.38340578,  0.2935787 ],
           [ 2.22189802, -0.25133484],
           [ 3.6053038 ,  0.04224385]])

如果要使用.transform,则需要将转换规则教给您的pca

In [20]: pca = RandomizedPCA(n_components=3)

In [21]: pca.fit(X)
Out[21]: 
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
       whiten=False)

In [22]: pca.transform(z)
Out[22]: 
array([[ 2.76681156,  0.58715739],
       [ 1.92831932,  1.13207093],
       [ 0.54491354,  0.83849224],
       [ 5.53362311,  1.17431479],
       [ 6.37211535,  0.62940125],
       [ 7.75552113,  0.92297994]])

In [23]: 

特别地,PCA变换将通过矩阵X的PCA分解获得的基数更改应用于矩阵Z。

The .transform method is meant for when you have already computed PCA, i.e. if you have already called its .fit method.

In [12]: pc2 = RandomizedPCA(n_components=3)

In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)

/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
    714         # XXX remove scipy.sparse support here in 0.16
    715         X = atleast2d_or_csr(X)
--> 716         if self.mean_ is not None:
    717             X = X - self.mean_
    718 

AttributeError: 'RandomizedPCA' object has no attribute 'mean_'

In [14]: pc2.ftransform(X) 
pc2.fit            pc2.fit_transform  

In [14]: pc2.fit_transform(X)
Out[14]: 
array([[-1.38340578, -0.2935787 ],
       [-2.22189802,  0.25133484],
       [-3.6053038 , -0.04224385],
       [ 1.38340578,  0.2935787 ],
       [ 2.22189802, -0.25133484],
       [ 3.6053038 ,  0.04224385]])
    
  

So you want to fit RandomizedPCA and then transform as:

In [20]: pca = RandomizedPCA(n_components=3)

In [21]: pca.fit(X)
Out[21]: 
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
       whiten=False)

In [22]: pca.transform(z)
Out[22]: 
array([[ 2.76681156,  0.58715739],
       [ 1.92831932,  1.13207093],
       [ 0.54491354,  0.83849224],
       [ 5.53362311,  1.17431479],
       [ 6.37211535,  0.62940125],
       [ 7.75552113,  0.92297994]])

In [23]: 

In particular PCA .transform applies the change of basis obtained through the PCA decomposition of the matrix X to the matrix Z.


回答 1

scikit-learn estimator api中

fit() :用于从训练数据生成学习模型参数

transform():从fit()方法生成的参数,应用于模型以生成转换后的数据集。

fit_transform()fit()transform()api在同一数据集上的组合

结帐章-4从这本书及答案从stackexchange为了更清楚

In scikit-learn estimator api,

fit() : used for generating learning model parameters from training data

transform() : parameters generated from fit() method,applied upon model to generate transformed data set.

fit_transform() : combination of fit() and transform() api on same data set

Checkout Chapter-4 from this book & answer from stackexchange for more clarity


回答 2

这些方法用于确定给定数据的中心/特征标度。它基本上有助于规范特定范围内的数据

为此,我们使用Z评分方法。

我们在训练数据集上执行此操作。

1. Fit():方法计算参数μ和σ并将其保存为内部对象。

2. Transform():使用这些计算出的参数的方法将转换应用于特定的数据集。

3. Fit_transform():将fit()和transform()方法结合在一起以进行数据集转换。

特征缩放/标准化的代码片段(在train_test_split之后)。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

我们在测试集上应用相同的(训练集相同的两个参数μ和σ(值))参数转换。

These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range

For this, we use Z-score method.

We do this on the training set of data.

1.Fit(): Method calculates the parameters μ and σ and saves them as internal objects.

2.Transform(): Method using these calculated parameters apply the transformation to a particular dataset.

3.Fit_transform(): joins the fit() and transform() method for transformation of dataset.

Code snippet for Feature Scaling/Standardisation(after train_test_split).

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.


回答 3

两种方法之间的通用差异:

  • 适合(raw_documents [,y]):学习原始文档中所有标记的词汇词典。
  • fit_transform(raw_documents [,y]):学习词汇词典并返回术语文档矩阵。这等效于紧随其后的变换,但实现效率更高。
  • transform(raw_documents):将文档转换为文档术语矩阵。使用适合的词汇表或提供给构造函数的词汇表从原始文本文档中提取令牌计数。

fit_transform和transform都返回相同的Document-term矩阵。

资源

Generic difference between the methods:

  • fit(raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents.
  • fit_transform(raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. This is equivalent to fit followed by the transform, but more efficiently implemented.
  • transform(raw_documents): Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Both fit_transform and transform returns the same, Document-term matrix.

Source


回答 4

这里.fit()&的基本区别.fit_transform()

。适合():

在有两个对象/参数(x,y)的监督学习中用于拟合模型并使模型运行,我们知道我们要预测的内容

.fit_transform():

在无监督学习中使用一个对象/参数(x),而我们不知道该预测什么。

Here the basic difference between .fit() & .fit_transform():

.fit():

is use in the Supervised learning having two object/parameter(x,y) to fit model and make model to run, where we know that what we are going to predict

.fit_transform():

is use in Unsupervised Learning having one object/parameter(x), where we don’t know, what we are going to predict.


回答 5

用外行的术语来说,fit_transform意味着先进行一些计算,然后再进行转换(例如,根据一些数据计算列的均值,然后替换缺少的值)。因此,对于训练集,您既需要计算又要进行转换。

但是对于测试集,机器学习根据训练集中学习到的内容应用预测,因此不需要计算,只需执行转换即可。

In layman’s terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.

But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn’t need to calculate, it just performs the transformation.


回答 6

为什么和何时使用每一个:

所有答复都很好,但是我将重点介绍为什么和何时使用每种方法。

fit(),transform(),fit_transform()

通常,我们有一个监督学习问题,其中(X,y)作为数据集,我们将其分为训练数据和测试数据:

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)

想象一下,我们正在安装一个令牌生成器,如果我们安装了X,我们会将测试数据包含到令牌生成器中,但是我已经多次看到这个错误!

正确的方法是只适合X_train,因为您不知道“您的未来数据”,因此您不能使用X_test数据来拟合任何东西!

然后,您可以转换测试数据,但是要分别进行转换,这就是为什么存在不同方法的原因。

最后的提示:X_train_transformed = model.fit_transform(X_train)等同于: X_train_transformed = model.fit(X_train).transform(X_train),但是第一个提示更快。

请注意,我所谓的“模型”通常是缩放器,tfidf转换器,其他类型的矢量化器,令牌化器…

Why and When use each one:

All the responses are quite good, but I would make emphasis in WHY and WHEN use each method.

fit(), transform(), fit_transform()

Usually we have a supervised learning problem with (X, y) as out dataset, and we split it into training data and test data:

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)

Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!

The correct is to fit ONLY with X_train, because you don’t know “your future data” so you cannot use X_test data for fitting anything!

Then you can transform your test data, but separately, that’s why there are different methods.

Final tip: X_train_transformed = model.fit_transform(X_train) is equivalent to: X_train_transformed = model.fit(X_train).transform(X_train), but the first one is faster.

Note that what I call “model” usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer…


熊猫将某些列转换为行

问题:熊猫将某些列转换为行

因此我的数据集具有n个日期的位置信息。问题在于每个日期实际上是一个不同的列标题。例如,CSV看起来像

location    name    Jan-2010    Feb-2010    March-2010
A           "test"  12          20          30
B           "foo"   18          20          25

我想要的是它看起来像

location    name    Date        Value
A           "test"  Jan-2010    12       
A           "test"  Feb-2010    20
A           "test"  March-2010  30
B           "foo"   Jan-2010    18       
B           "foo"   Feb-2010    20
B           "foo"   March-2010  25

问题是我不知道列中有多少个日期(尽管我知道它们总是以名字开头)

So my dataset has some information by location for n dates. The problem is each date is actually a different column header. For example the CSV looks like

location    name    Jan-2010    Feb-2010    March-2010
A           "test"  12          20          30
B           "foo"   18          20          25

What I would like is for it to look like

location    name    Date        Value
A           "test"  Jan-2010    12       
A           "test"  Feb-2010    20
A           "test"  March-2010  30
B           "foo"   Jan-2010    18       
B           "foo"   Feb-2010    20
B           "foo"   March-2010  25

problem is I don’t know how many dates are in the column (though I know they will always start after name)


回答 0

UPDATE
从v0.20开始,它melt是一阶函数,您现在可以使用

df.melt(id_vars=["location", "name"], 
        var_name="Date", 
        value_name="Value")

  location    name        Date  Value
0        A  "test"    Jan-2010     12
1        B   "foo"    Jan-2010     18
2        A  "test"    Feb-2010     20
3        B   "foo"    Feb-2010     20
4        A  "test"  March-2010     30
5        B   "foo"  March-2010     25

旧版(ER):<0.20

您可以使用pd.melt来获取大部分信息,然后进行排序:

>>> df
  location  name  Jan-2010  Feb-2010  March-2010
0        A  test        12        20          30
1        B   foo        18        20          25
>>> df2 = pd.melt(df, id_vars=["location", "name"], 
                  var_name="Date", value_name="Value")
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
1        B   foo    Jan-2010     18
2        A  test    Feb-2010     20
3        B   foo    Feb-2010     20
4        A  test  March-2010     30
5        B   foo  March-2010     25
>>> df2 = df2.sort(["location", "name"])
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
2        A  test    Feb-2010     20
4        A  test  March-2010     30
1        B   foo    Jan-2010     18
3        B   foo    Feb-2010     20
5        B   foo  March-2010     25

(可能想输入.reset_index(drop=True),只是为了保持输出清洁。)

pd.DataFrame.sort 已弃用赞成pd.DataFrame.sort_values

UPDATE
From v0.20, melt is a first order function, you can now use

df.melt(id_vars=["location", "name"], 
        var_name="Date", 
        value_name="Value")

  location    name        Date  Value
0        A  "test"    Jan-2010     12
1        B   "foo"    Jan-2010     18
2        A  "test"    Feb-2010     20
3        B   "foo"    Feb-2010     20
4        A  "test"  March-2010     30
5        B   "foo"  March-2010     25

OLD(ER) VERSIONS: <0.20

You can use pd.melt to get most of the way there, and then sort:

>>> df
  location  name  Jan-2010  Feb-2010  March-2010
0        A  test        12        20          30
1        B   foo        18        20          25
>>> df2 = pd.melt(df, id_vars=["location", "name"], 
                  var_name="Date", value_name="Value")
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
1        B   foo    Jan-2010     18
2        A  test    Feb-2010     20
3        B   foo    Feb-2010     20
4        A  test  March-2010     30
5        B   foo  March-2010     25
>>> df2 = df2.sort(["location", "name"])
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
2        A  test    Feb-2010     20
4        A  test  March-2010     30
1        B   foo    Jan-2010     18
3        B   foo    Feb-2010     20
5        B   foo  March-2010     25

(Might want to throw in a .reset_index(drop=True), just to keep the output clean.)

Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.


回答 1

使用set_indexstackMultiIndex Series,然后DataFramereset_indexrename

df1 = (df.set_index(["location", "name"])
         .stack()
         .reset_index(name='Value')
         .rename(columns={'level_2':'Date'}))
print (df1)
  location  name        Date  Value
0        A  test    Jan-2010     12
1        A  test    Feb-2010     20
2        A  test  March-2010     30
3        B   foo    Jan-2010     18
4        B   foo    Feb-2010     20
5        B   foo  March-2010     25

Use set_index with stack for MultiIndex Series, then for DataFrame add reset_index with rename:

df1 = (df.set_index(["location", "name"])
         .stack()
         .reset_index(name='Value')
         .rename(columns={'level_2':'Date'}))
print (df1)
  location  name        Date  Value
0        A  test    Jan-2010     12
1        A  test    Feb-2010     20
2        A  test  March-2010     30
3        B   foo    Jan-2010     18
4        B   foo    Feb-2010     20
5        B   foo  March-2010     25

回答 2

我想我找到了一个更简单的解决方案

temp1 = pd.melt(df1, id_vars=["location"], var_name='Date', value_name='Value')
temp2 = pd.melt(df1, id_vars=["name"], var_name='Date', value_name='Value')

Concat temp1temp2的专栏name

temp1['new_column'] = temp2['name']

现在,您有了所需的东西。

I guess I found a simpler solution

temp1 = pd.melt(df1, id_vars=["location"], var_name='Date', value_name='Value')
temp2 = pd.melt(df1, id_vars=["name"], var_name='Date', value_name='Value')

Concat whole temp1 with temp2‘s column name

temp1['new_column'] = temp2['name']

You now have what you asked for.


回答 3

pd.wide_to_long

您可以在年份列中添加前缀,然后直接输入pd.wide_to_long。我不会假装这是有效的,但是在某些情况下,它可能比方便pd.melt,例如,当您的列已经具有适当的前缀时。

df.columns = np.hstack((df.columns[:2], df.columns[2:].map(lambda x: f'Value{x}')))

res = pd.wide_to_long(df, stubnames=['Value'], i='name', j='Date').reset_index()\
        .sort_values(['location', 'name'])

print(res)

   name        Date location  Value
0  test    Jan-2010        A     12
2  test    Feb-2010        A     20
4  test  March-2010        A     30
1   foo    Jan-2010        B     18
3   foo    Feb-2010        B     20
5   foo  March-2010        B     25

pd.wide_to_long

You can add a prefix to your year columns and then feed directly to pd.wide_to_long. I won’t pretend this is efficient, but it may in certain situations be more convenient than pd.melt, e.g. when your columns already have an appropriate prefix.

df.columns = np.hstack((df.columns[:2], df.columns[2:].map(lambda x: f'Value{x}')))

res = pd.wide_to_long(df, stubnames=['Value'], i='name', j='Date').reset_index()\
        .sort_values(['location', 'name'])

print(res)

   name        Date location  Value
0  test    Jan-2010        A     12
2  test    Feb-2010        A     20
4  test  March-2010        A     30
1   foo    Jan-2010        B     18
3   foo    Feb-2010        B     20
5   foo  March-2010        B     25

Python 3中的字符串格式

问题:Python 3中的字符串格式

我在Python 2中执行此操作:

"(%d goals, $%d)" % (self.goals, self.penalties)

这是什么Python 3版本?

我尝试在线搜索示例,但一直在获取Python 2版本。

I do this in Python 2:

"(%d goals, $%d)" % (self.goals, self.penalties)

What is the Python 3 version of this?

I tried searching for examples online but I kept getting Python 2 versions.


回答 0

这是有关“新”格式语法的文档。一个例子是:

"({:d} goals, ${:d})".format(self.goals, self.penalties)

如果goalspenalties均为整数(即​​默认格式为ok),则可以将其缩短为:

"({} goals, ${})".format(self.goals, self.penalties)

而且由于参数是的字段self,所以还有一种使用单个参数两次的方法(如@Burhan Khalid在评论中指出的那样):

"({0.goals} goals, ${0.penalties})".format(self)

说明:

  • {} 表示仅下一个位置参数,采用默认格式;
  • {0}表示带有index的参数0,默认格式;
  • {:d} 是下一个位置参数,采用十进制整数格式;
  • {0:d}是带有index的参数0,十进制整数格式。

选择参数时,您可以执行许多其他操作(使用命名参数而不是位置参数,访问字段等),还有许多格式选项(填充数字,使用千位分隔符,显示或不显示符号等)。其他一些例子:

"({goals} goals, ${penalties})".format(goals=2, penalties=4)
"({goals} goals, ${penalties})".format(**self.__dict__)

"first goal: {0.goal_list[0]}".format(self)
"second goal: {.goal_list[1]}".format(self)

"conversion rate: {:.2f}".format(self.goals / self.shots) # '0.20'
"conversion rate: {:.2%}".format(self.goals / self.shots) # '20.45%'
"conversion rate: {:.0%}".format(self.goals / self.shots) # '20%'

"self: {!s}".format(self) # 'Player: Bob'
"self: {!r}".format(self) # '<__main__.Player instance at 0x00BF7260>'

"games: {:>3}".format(player1.games)  # 'games: 123'
"games: {:>3}".format(player2.games)  # 'games:   4'
"games: {:0>3}".format(player2.games) # 'games: 004'

注意:正如其他人指出的那样,新格式并不能取代前一种格式,这两种格式都可以在Python 3和更高版本的Python 2中使用。有人可能会说这是优先考虑的问题,但是恕我直言,较新的版本比旧的版本更具表现力,并且在编写新代码时应使用(当然,除非它针对较旧的环境)。

Here are the docs about the “new” format syntax. An example would be:

"({:d} goals, ${:d})".format(self.goals, self.penalties)

If both goals and penalties are integers (i.e. their default format is ok), it could be shortened to:

"({} goals, ${})".format(self.goals, self.penalties)

And since the parameters are fields of self, there’s also a way of doing it using a single argument twice (as @Burhan Khalid noted in the comments):

"({0.goals} goals, ${0.penalties})".format(self)

Explaining:

  • {} means just the next positional argument, with default format;
  • {0} means the argument with index 0, with default format;
  • {:d} is the next positional argument, with decimal integer format;
  • {0:d} is the argument with index 0, with decimal integer format.

There are many others things you can do when selecting an argument (using named arguments instead of positional ones, accessing fields, etc) and many format options as well (padding the number, using thousands separators, showing sign or not, etc). Some other examples:

"({goals} goals, ${penalties})".format(goals=2, penalties=4)
"({goals} goals, ${penalties})".format(**self.__dict__)

"first goal: {0.goal_list[0]}".format(self)
"second goal: {.goal_list[1]}".format(self)

"conversion rate: {:.2f}".format(self.goals / self.shots) # '0.20'
"conversion rate: {:.2%}".format(self.goals / self.shots) # '20.45%'
"conversion rate: {:.0%}".format(self.goals / self.shots) # '20%'

"self: {!s}".format(self) # 'Player: Bob'
"self: {!r}".format(self) # '<__main__.Player instance at 0x00BF7260>'

"games: {:>3}".format(player1.games)  # 'games: 123'
"games: {:>3}".format(player2.games)  # 'games:   4'
"games: {:0>3}".format(player2.games) # 'games: 004'

Note: As others pointed out, the new format does not supersede the former, both are available both in Python 3 and the newer versions of Python 2 as well. Some may say it’s a matter of preference, but IMHO the newer is much more expressive than the older, and should be used whenever writing new code (unless it’s targeting older environments, of course).


回答 1

Python 3.6现在支持PEP 498的简写文字字符串插值。对于您的用例,新语法很简单:

f"({self.goals} goals, ${self.penalties})"

这与以前的.format标准,但让一个容易做的事情一样

>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal('12.34567')
>>> f'result: {value:{width}.{precision}}'
'result:      12.35'

Python 3.6 now supports shorthand literal string interpolation with PEP 498. For your use case, the new syntax is simply:

f"({self.goals} goals, ${self.penalties})"

This is similar to the previous .format standard, but lets one easily do things like:

>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal('12.34567')
>>> f'result: {value:{width}.{precision}}'
'result:      12.35'

回答 2

该行在Python 3中保持原样。

>>> sys.version
'3.2 (r32:88445, Oct 20 2012, 14:09:29) \n[GCC 4.5.2]'
>>> "(%d goals, $%d)" % (self.goals, self.penalties)
'(1 goals, $2)'

That line works as-is in Python 3.

>>> sys.version
'3.2 (r32:88445, Oct 20 2012, 14:09:29) \n[GCC 4.5.2]'
>>> "(%d goals, $%d)" % (self.goals, self.penalties)
'(1 goals, $2)'

回答 3

我喜欢这种方法

my_hash = {}
my_hash["goals"] = 3 #to show number
my_hash["penalties"] = "5" #to show string
print("I scored %(goals)d goals and took %(penalties)s penalties" % my_hash)

请分别注意括号中的d和s。

输出将是:

I scored 3 goals and took 5 penalties

I like this approach

my_hash = {}
my_hash["goals"] = 3 #to show number
my_hash["penalties"] = "5" #to show string
print("I scored %(goals)d goals and took %(penalties)s penalties" % my_hash)

Note the appended d and s to the brackets respectively.

output will be:

I scored 3 goals and took 5 penalties

在Python中,“虽然不是EOF”的完美替代品是什么?

问题:在Python中,“虽然不是EOF”的完美替代品是什么?

要读取一些文本文件,无论是C还是Pascal,我始终使用以下代码段读取数据,直到EOF:

while not eof do begin
  readline(a);
  do_something;
end;

因此,我想知道如何在Python中简单快速地做到这一点?

To read some text file, in C or Pascal, I always use the following snippets to read the data until EOF:

while not eof do begin
  readline(a);
  do_something;
end;

Thus, I wonder how can I do this simple and fast in Python?


回答 0

循环遍历文件以读取行:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

文件对象是可迭代的,并在EOF之前产生行。将文件对象用作可迭代对象使用缓冲区来确保性能读取。

您可以使用stdin进行相同操作(无需使用raw_input()

import sys

for line in sys.stdin:
    do_something()

为了完成图片,可以使用以下方式进行二进制读取:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

其中chunk将包含多达1024个字节从文件中的时间,而当迭代停止openfileobject.read(1024)开始使空字节字符串。

Loop over the file to read lines:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

File objects are iterable and yield lines until EOF. Using the file object as an iterable uses a buffer to ensure performant reads.

You can do the same with the stdin (no need to use raw_input():

import sys

for line in sys.stdin:
    do_something()

To complete the picture, binary reads can be done with:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

where chunk will contain up to 1024 bytes at a time from the file, and iteration stops when openfileobject.read(1024) starts returning empty byte strings.


回答 1

您可以在Python中模仿C语言。

要读取不超过max_size字节数的缓冲区,可以执行以下操作:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

或者,一行一行地显示文本文件:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

您需要使用while True / break构造函数,因为除了缺少读取返回的字节以外,Python中没有eof测试

在C语言中,您可能具有:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

但是,您不能在Python中使用此功能:

 while (line = f.readline()):
     # syntax error

因为在Python的表达式不允许赋值(尽管Python的最新版本可以使用赋值表达式来模仿它,请参见下文)。

在Python中这样做当然惯用了:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

更新:从Python 3.8开始,您还可以使用赋值表达式

 while line := f.readline():
     process(line)

You can imitate the C idiom in Python.

To read a buffer up to max_size number of bytes, you can do this:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

Or, a text file line by line:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

You need to use while True / break construct since there is no eof test in Python other than the lack of bytes returned from a read.

In C, you might have:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

However, you cannot have this in Python:

 while (line = f.readline()):
     # syntax error

because assignments are not allowed in expressions in Python (although recent versions of Python can mimic this using assignment expressions, see below).

It is certainly more idiomatic in Python to do this:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

Update: Since Python 3.8 you may also use assignment expressions:

 while line := f.readline():
     process(line)

回答 2

用于打开文件并逐行读取的Python习惯用法是:

with open('filename') as f:
    for line in f:
        do_something(line)

该文件将在上述代码的末尾自动关闭(该with结构将完成此工作)。

最后,值得注意的是line将保留尾随的换行符。可以使用以下方法轻松删除它:

line = line.rstrip()

The Python idiom for opening a file and reading it line-by-line is:

with open('filename') as f:
    for line in f:
        do_something(line)

The file will be automatically closed at the end of the above code (the with construct takes care of that).

Finally, it is worth noting that line will preserve the trailing newline. This can be easily removed using:

line = line.rstrip()

回答 3

您可以使用下面的代码片段逐行读取,直到文件结尾

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

You can use below code snippet to read line by line, till end of file

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

回答 4

尽管上面有“以python方式实现”的建议,但如果真的想有一个基于EOF的逻辑,那么我想使用异常处理是做到这一点的方法-

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

例:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

或者按Ctrl-Zraw_input()提示符(Windows,Ctrl-ZLinux的)

While there are suggestions above for “doing it the python way”, if one wants to really have a logic based on EOF, then I suppose using exception handling is the way to do it —

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

Example:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

Or press Ctrl-Z at a raw_input() prompt (Windows, Ctrl-Z Linux)


回答 5

您可以使用以下代码段。readlines()一次读取整个文件并按行分割。

line = obj.readlines()

You can use the following code snippet. readlines() reads in the whole file at once and splits it by line.

line = obj.readlines()

回答 6

除了@dawg的好答案之外,使用walrus运算符的等效解决方案(Python> = 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

In addition to @dawg’s great answer, the equivalent solution using walrus operator (Python >= 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

有趣好用的Python教程

退出移动版
微信支付
请使用 微信 扫码支付