Python 实用宝典

Question 1

I have url from the user and I have to reply with the fetched HTML.

How can I check for the URL to be malformed or not?

For example :

url='google'  // Malformed
url='google.com'  // Malformed
url='http://google.com'  // Valid
url='http://google'   // Malformed

Question 2

django url validation regex (source):

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

Question 3

Actually, I think this is the best way.

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

If you set verify_exists to True, it will actually verify that the URL exists, otherwise it will just check if it’s formed correctly.

edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django’s validators?

Question 4

Use the validators package:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

Install it from PyPI with pip (pip install validators).

Question 5

A True or False version, based on @DMfll answer:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))

Gives:

True
False
False
False

Question 6

Nowadays, I use the following, based on the Padam’s answer:

$ python --version
Python 3.6.5

And this is how it looks:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

Just use is_url("http://www.asdf.com").

Hope it helps!

Question 7

note – lepl is no longer supported, sorry (you’re welcome to use it, and i think the code below works, but it’s not going to get updates).

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html

to use:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

Question 8

I landed on this page trying to figure out a sane way to validate strings as “valid” urls. I share here my solution using python3. No extra libraries required.

See https://docs.python.org/2/library/urlparse.html if you are using python2.

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme=”, netloc=”, path=’dkakasdkjdjakdjadjfalskdjfalk’, params=”, query=”, fragment=”)

ParseResult(scheme=’https’, netloc=’stackoverflow.com’, path=”, params=”, query=”, fragment=”)

‘dkakasdkjdjakdjadjfalskdjfalk’ string has no scheme or netloc.

‘https://stackoverflow.com‘ is probably a valid url.

Here is a more concise function:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

Question 9

EDIT

As pointed out by @Kwame , the below code does validate the url even if the .com or .co etc are not present.

also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately.

This is simple and works:

So min_attr contains the basic set of strings that needs to be present to define the validity of a URL, i.e http:// part and google.com part.

urlparse.scheme stores http:// and

urlparse.netloc store the domain name google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all() returns true if all the variables inside it return true. So if result.scheme and result.netloc is present i.e. has some value then the URL is valid and hence returns True.

Question 10

Validate URL with `urllib` and Django-like regex

The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

Explanation

The code only validates the scheme and netloc part of a given URL. (To do this properly, I split the URL with urllib.parse.urlparse() in the two according parts which are then matched with the corresponding regex terms.)

The netloc part stops before the first occurrence of a slash /, so port numbers are still part of the netloc, e.g.:

https://www.google.com:80/search?q=python
^^^^^   ^^^^^^^^^^^^^^^^^
  |             |      
  |             +-- netloc (aka "domain" in my code)
  +-- scheme

IPv4 addresses are also validated

IPv6 Support

If you want the URL validator to also work with IPv6 addresses, do the following:

Add is_valid_ipv6(ip) from Markus Jarderot’s answer, which has a really good IPv6 validator regex
Add and not is_valid_ipv6(domain) to the last if

Examples

Here are some examples of the regex for the netloc (aka domain) part in action:

IPv4 and alphanumeric: https://regex101.com/r/A326u1/5
IPv6: https://regex101.com/r/lKIIgq/1 (with the regex from Markus Jarderot’s answer)

Question 11

All of the above solutions recognize a string like “http://www.google.com/path,www.yahoo.com/path” as valid. This solution always works as it should

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

Question 12

I have two lists that i need to combine where the second list has any duplicates of the first list ignored. .. A bit hard to explain, so let me show an example of what the code looks like, and what i want as a result.

first_list = [1, 2, 2, 5]

second_list = [2, 5, 7, 9]

# The result of combining the two lists should result in this list:
resulting_list = [1, 2, 2, 5, 7, 9]

You’ll notice that the result has the first list, including its two “2” values, but the fact that second_list also has an additional 2 and 5 value is not added to the first list.

Normally for something like this i would use sets, but a set on first_list would purge the duplicate values it already has. So i’m simply wondering what the best/fastest way to achieve this desired combination.

Thanks.

Question 13

You need to append to the first list those elements of the second list that aren’t in the first – sets are the easiest way of determining which elements they are, like this:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

in_first = set(first_list)
in_second = set(second_list)

in_second_but_not_in_first = in_second - in_first

result = first_list + list(in_second_but_not_in_first)
print(result)  # Prints [1, 2, 2, 5, 9, 7]

Or if you prefer one-liners 8-)

print(first_list + list(set(second_list) - set(first_list)))

Question 14

resulting_list = list(first_list)
resulting_list.extend(x for x in second_list if x not in resulting_list)

Question 15

You can use sets:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

resultList= list(set(first_list) | set(second_list))

print(resultList)
# Results in : resultList = [1,2,5,7,9]

Question 16

You can bring this down to one single line of code if you use numpy:

a = [1,2,3,4,5,6,7]
b = [2,4,7,8,9,10,11,12]

sorted(np.unique(a+b))

>>> [1,2,3,4,5,6,7,8,9,10,11,12]

Question 17

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

print( set( first_list + second_list ) )

Question 18

resulting_list = first_list + [i for i in second_list if i not in first_list]

Question 19

Simplest to me is:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

merged_list = list(set(first_list+second_list))
print(merged_list)

#prints [1, 2, 5, 7, 9]

Question 20

You can also combine RichieHindle’s and Ned Batchelder’s responses for an average-case O(m+n) algorithm that preserves order:

first_list = [1, 2, 2, 5]
second_list = [2, 5, 7, 9]

fs = set(first_list)
resulting_list = first_list + [x for x in second_list if x not in fs]

assert(resulting_list == [1, 2, 2, 5, 7, 9])

Note that x in s has a worst-case complexity of O(m), so the worst-case complexity of this code is still O(m*n).

Question 21

This might help

def union(a,b):
    for e in b:
        if e not in a:
            a.append(e)

The union function merges the second list into first, with out duplicating an element of a, if it’s already in a. Similar to set union operator. This function does not change b. If a=[1,2,3] b=[2,3,4]. After union(a,b) makes a=[1,2,3,4] and b=[2,3,4]

Question 22

Based on the recipe :

resulting_list = list(set().union(first_list, second_list))

Question 23

    first_list = [1, 2, 2, 5]
    second_list = [2, 5, 7, 9]

    newList=[]
    for i in first_list:
        newList.append(i)
    for z in second_list:
        if z not in newList:
            newList.append(z)
    newList.sort()
    print newList

[1, 2, 2, 5, 7, 9]

Question 24

I would like to get just the folder path from the full path to a file.

For example T:\Data\DBDesign\DBDesign_93_v141b.mdb and I would like to get just T:\Data\DBDesign (excluding the \DBDesign_93_v141b.mdb).

I have tried something like this:

existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = str(existGDBPath.split('\\')[0:-1])
print wkspFldr

but it gave me a result like this:

['T:', 'Data', 'DBDesign']

which is not the result that I require (being T:\Data\DBDesign).

Any ideas on how I can get the path to my file?

Question 25

You were almost there with your use of the split function. You just needed to join the strings, like follows.

>>> import os
>>> '\\'.join(existGDBPath.split('\\')[0:-1])
'T:\\Data\\DBDesign'

Although, I would recommend using the os.path.dirname function to do this, you just need to pass the string, and it’ll do the work for you. Since, you seem to be on windows, consider using the abspath function too. An example:

>>> import os
>>> os.path.dirname(os.path.abspath(existGDBPath))
'T:\\Data\\DBDesign'

If you want both the file name and the directory path after being split, you can use the os.path.split function which returns a tuple, as follows.

>>> import os
>>> os.path.split(os.path.abspath(existGDBPath))
('T:\\Data\\DBDesign', 'DBDesign_93_v141b.mdb')

Question 26

WITH PATHLIB MODULE (UPDATED ANSWER)

One should consider using pathlib for new development. It is in the stdlib for Python3.4, but available on PyPI for earlier versions. This library provides a more object-orented method to manipulate paths <opinion> and is much easier read and program with </opinion>.

>>> import pathlib
>>> existGDBPath = pathlib.Path(r'T:\Data\DBDesign\DBDesign_93_v141b.mdb')
>>> wkspFldr = existGDBPath.parent
>>> print wkspFldr
Path('T:\Data\DBDesign')

WITH OS MODULE

Use the os.path module:

>>> import os
>>> existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
>>> wkspFldr = os.path.dirname(existGDBPath)
>>> print wkspFldr 
'T:\Data\DBDesign'

You can go ahead and assume that if you need to do some sort of filename manipulation it’s already been implemented in os.path. If not, you’ll still probably need to use this module as the building block.

Question 27

The built-in submodule os.path has a function for that very task.

import os
os.path.dirname('T:\Data\DBDesign\DBDesign_93_v141b.mdb')

Question 28

Here is the code:

import os
existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb'
wkspFldr = os.path.dirname(existGDBPath)
print wkspFldr # T:\Data\DBDesign

Question 29

Here is my little utility helper for splitting paths int file, path tokens:

import os    
# usage: file, path = splitPath(s)
def splitPath(s):
    f = os.path.basename(s)
    p = s[:-(len(f))-1]
    return f, p

Question 30

Anyone trying to do this in the ESRI GIS Table field calculator interface can do this with the Python parser:

PathToContainingFolder =

"\\".join(!FullFilePathWithFileName!.split("\\")[0:-1])

so that

\Users\me\Desktop\New folder\file.txt

becomes

\Users\me\Desktop\New folder

Question 31

I want Jupyter to print all the interactive output without resorting to print, not only the last result. How to do it?

Example :

a=3
a
a+1

I would like to display

3
4

Question 32

Thanks to Thomas, here is the solution I was looking for:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Question 33

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

1) Place this code in a Jupyter cell:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

2) In Windows, the steps below makes the change permanent. Should work for other operating systems. You might have to change the path.

C:\Users\your_profile\\.ipython\profile_default

Make a ipython_config.py file in the profile_defaults with the following code:

c = get_config()

c.InteractiveShell.ast_node_interactivity = "all"

Question 34

Per Notebook Basis

As others have answered, putting the following code in a Jupyter Lab or Jupyter Notebook cell will work:

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

Permanent Change

However, if you would like to make this permanent and use Jupyter Lab, you will need to create an IPython notebook config file. Run the following command to do so (DO NOT run if you use Jupyter Notebook – more details below):

ipython profile create

If you are using Jupyter Notebook, this file should have already been created and there will be no need to run it again. In fact, running this command may overwrite your current preferences.

Once you have this file created, for Jupyter Lab and Notebook users alike, add the following code to the file C:\Users\USERNAME\\.ipython\profile_default\ipython_config.py:

c.InteractiveShell.ast_node_interactivity = "all"

I found there is no need for c = get_config() in the newer versions of Jupyter, but if this doesn’t work for you, add the c = get_config() to the beginning of the file.

For more flag options other than "all", visit this link: https://ipython.readthedocs.io/en/stable/config/options/terminal.html#configtrait-InteractiveShell.ast_node_interactivity

Question 35

I’m using matplotlib to plot data (using plot and errorbar functions) from Python. I have to plot a set of totally separate and independent plots, and then adjust their ylim values so they can be easily visually compared.

How can I retrieve the ylim values from each plot, so that I can take the min and max of the lower and upper ylim values, respectively, and adjust the plots so they can be visually compared?

Of course, I could just analyze the data and come up with my own custom ylim values… but I’d like to use matplotlib to do that for me. Any suggestions on how to easily (and efficiently) do this?

Here’s my Python function that plots using matplotlib:

import matplotlib.pyplot as plt

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    # axes
    axes = plt.gca()
    axes.set_xlim([-0.5, len(values) - 0.5])
    axes.set_xlabel('My x-axis title')
    axes.set_ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

    # close figure
    plt.close(fig)

Question 36

Just use axes.get_ylim(), it is very similar to set_ylim. From the docs:

get_ylim()

Get the y-axis range [bottom, top]

Question 37

 ymin, ymax = axes.get_ylim()

If you are using the plt api directly, you can avoid calls to the axes altogether:

def myplotfunction(title, values, errors, plot_file_name):

    # plot errorbars
    indices = range(0, len(values))
    fig = plt.figure()
    plt.errorbar(tuple(indices), tuple(values), tuple(errors), marker='.')

    plt.xlim([-0.5, len(values) - 0.5])
    plt.xlabel('My x-axis title')
    plt.ylabel('My y-axis title')

    # title
    plt.title(title)

    # save as file
    plt.savefig(plot_file_name)

   # close figure
    plt.close(fig)

Question 38

Leveraging from the good answers above and assuming you were only using plt as in

import matplotlib.pyplot as plt

then you can get all four plot limits using plt.axis() as in the following example.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7, 8]  # fake data
y = [1, 2, 3, 4, 3, 2, 5, 6]

plt.plot(x, y, 'k')

xmin, xmax, ymin, ymax = plt.axis()

s = 'xmin = ' + str(round(xmin, 2)) + ', ' + \
    'xmax = ' + str(xmax) + '\n' + \
    'ymin = ' + str(ymin) + ', ' + \
    'ymax = ' + str(ymax) + ' '

plt.annotate(s, (1, 5))

plt.show()

The above code should produce the following output plot.

Question 39

It’s easy to turn a list of lists into a pandas dataframe:

import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])

But how do I turn df back into a list of lists?

lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]

Question 40

You could access the underlying array and call its tolist method:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]

Question 41

If the data has column and index labels that you want to preserve, there are a few options.

Example data:

>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
       columns=('first', 'second', 'third'), \
       index=('alpha', 'beta')) 
>>> df
       first  second  third
alpha      1       2      3
beta       3       4      5

The tolist() method described in other answers is useful but yields only the core data – which may not be enough, depending on your needs.

>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]

One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.

>>> df.to_json()
{
  "first":{"alpha":1,"beta":3},
  "second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}

>>> df.to_json(orient='split')
{
 "columns":["first","second","third"],
 "index":["alpha","beta"],
 "data":[[1,2,3],[3,4,5]]
}

Cumbersome but may be useful.

The good news is that it’s pretty straightforward to build lists for the columns and rows:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]

This yields:

>>> print(f"columns: {columns}\nrows: {rows}") 
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

If the None as the name of the index is bothersome, rename it:

df = df.rename_axis('stage')

Then:

>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}") 

columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

Question 42

I don’t know if it will fit your needs, but you can also do:

>>> lol = df.values
>>> lol
array([[1, 2, 3],
       [3, 4, 5]])

This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.

Question 43

I wanted to preserve the index, so I adapted the original answer to this solution:

list_df = df.reset_index().values.tolist()

Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:

pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)

Question 44

Maybe something changed but this gave back a list of ndarrays which did what I needed.

list(df.values)

Question 45

Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you’re new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.

To quote a comment by @jpp:

In practice, there’s often no need to convert the NumPy array into a list of lists.

If a Pandas DataFrame/Series won’t work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.

Question 46

This is very simple:

import numpy as np

list_of_lists = np.array(df)

Question 47

We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:

# Empty list 
row_list =[] 

# Iterate over each row 
for index, rows in df.iterrows(): 
    # Create list for the current row 
    my_list =[rows.Date, rows.Event, rows.Cost] 

    # append the list to the final list 
    row_list.append(my_list) 

# Print 
print(row_list)

We can successfully extract each row of the given data frame into a list

Question 48

In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows

But what is the difference between them ?

Question 49

The .transform method is meant for when you have already computed PCA, i.e. if you have already called its .fit method.

In [12]: pc2 = RandomizedPCA(n_components=3)

In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)

/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
    714         # XXX remove scipy.sparse support here in 0.16
    715         X = atleast2d_or_csr(X)
--> 716         if self.mean_ is not None:
    717             X = X - self.mean_
    718 

AttributeError: 'RandomizedPCA' object has no attribute 'mean_'

In [14]: pc2.ftransform(X) 
pc2.fit            pc2.fit_transform  

In [14]: pc2.fit_transform(X)
Out[14]: 
array([[-1.38340578, -0.2935787 ],
       [-2.22189802,  0.25133484],
       [-3.6053038 , -0.04224385],
       [ 1.38340578,  0.2935787 ],
       [ 2.22189802, -0.25133484],
       [ 3.6053038 ,  0.04224385]])

So you want to fit RandomizedPCA and then transform as:

In [20]: pca = RandomizedPCA(n_components=3)

In [21]: pca.fit(X)
Out[21]: 
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
       whiten=False)

In [22]: pca.transform(z)
Out[22]: 
array([[ 2.76681156,  0.58715739],
       [ 1.92831932,  1.13207093],
       [ 0.54491354,  0.83849224],
       [ 5.53362311,  1.17431479],
       [ 6.37211535,  0.62940125],
       [ 7.75552113,  0.92297994]])

In [23]:

In particular PCA .transform applies the change of basis obtained through the PCA decomposition of the matrix X to the matrix Z.

Question 50

In scikit-learn estimator api,

fit() : used for generating learning model parameters from training data

transform() : parameters generated from fit() method,applied upon model to generate transformed data set.

fit_transform() : combination of fit() and transform() api on same data set

Checkout Chapter-4 from this book & answer from stackexchange for more clarity

Question 51

These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range

For this, we use Z-score method.

We do this on the training set of data.

1.Fit(): Method calculates the parameters μ and σ and saves them as internal objects.

2.Transform(): Method using these calculated parameters apply the transformation to a particular dataset.

3.Fit_transform(): joins the fit() and transform() method for transformation of dataset.

Code snippet for Feature Scaling/Standardisation(after train_test_split).

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.

Question 52

Generic difference between the methods:

fit(raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. This is equivalent to fit followed by the transform, but more efficiently implemented.
transform(raw_documents): Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Both fit_transform and transform returns the same, Document-term matrix.

Source

Question 53

Here the basic difference between .fit() & .fit_transform():

.fit():

is use in the Supervised learning having two object/parameter(x,y) to fit model and make model to run, where we know that what we are going to predict

.fit_transform():

is use in Unsupervised Learning having one object/parameter(x), where we don’t know, what we are going to predict.

Question 54

In layman’s terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.

But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn’t need to calculate, it just performs the transformation.

Question 55

Why and When use each one:

All the responses are quite good, but I would make emphasis in WHY and WHEN use each method.

fit(), transform(), fit_transform()

Usually we have a supervised learning problem with (X, y) as out dataset, and we split it into training data and test data:

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)

Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!

The correct is to fit ONLY with X_train, because you don’t know “your future data” so you cannot use X_test data for fitting anything!

Then you can transform your test data, but separately, that’s why there are different methods.

Final tip: X_train_transformed = model.fit_transform(X_train) is equivalent to: X_train_transformed = model.fit(X_train).transform(X_train), but the first one is faster.

Note that what I call “model” usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer…

Question 56

So my dataset has some information by location for n dates. The problem is each date is actually a different column header. For example the CSV looks like

location    name    Jan-2010    Feb-2010    March-2010
A           "test"  12          20          30
B           "foo"   18          20          25

What I would like is for it to look like

location    name    Date        Value
A           "test"  Jan-2010    12       
A           "test"  Feb-2010    20
A           "test"  March-2010  30
B           "foo"   Jan-2010    18       
B           "foo"   Feb-2010    20
B           "foo"   March-2010  25

problem is I don’t know how many dates are in the column (though I know they will always start after name)

Question 57

UPDATE
From v0.20, melt is a first order function, you can now use

df.melt(id_vars=["location", "name"], 
        var_name="Date", 
        value_name="Value")

  location    name        Date  Value
0        A  "test"    Jan-2010     12
1        B   "foo"    Jan-2010     18
2        A  "test"    Feb-2010     20
3        B   "foo"    Feb-2010     20
4        A  "test"  March-2010     30
5        B   "foo"  March-2010     25

OLD(ER) VERSIONS: <0.20

You can use pd.melt to get most of the way there, and then sort:

>>> df
  location  name  Jan-2010  Feb-2010  March-2010
0        A  test        12        20          30
1        B   foo        18        20          25
>>> df2 = pd.melt(df, id_vars=["location", "name"], 
                  var_name="Date", value_name="Value")
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
1        B   foo    Jan-2010     18
2        A  test    Feb-2010     20
3        B   foo    Feb-2010     20
4        A  test  March-2010     30
5        B   foo  March-2010     25
>>> df2 = df2.sort(["location", "name"])
>>> df2
  location  name        Date  Value
0        A  test    Jan-2010     12
2        A  test    Feb-2010     20
4        A  test  March-2010     30
1        B   foo    Jan-2010     18
3        B   foo    Feb-2010     20
5        B   foo  March-2010     25

(Might want to throw in a .reset_index(drop=True), just to keep the output clean.)

Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.

Question 58

Use set_index with stack for MultiIndex Series, then for DataFrame add reset_index with rename:

df1 = (df.set_index(["location", "name"])
         .stack()
         .reset_index(name='Value')
         .rename(columns={'level_2':'Date'}))
print (df1)
  location  name        Date  Value
0        A  test    Jan-2010     12
1        A  test    Feb-2010     20
2        A  test  March-2010     30
3        B   foo    Jan-2010     18
4        B   foo    Feb-2010     20
5        B   foo  March-2010     25

Question 59

I guess I found a simpler solution

temp1 = pd.melt(df1, id_vars=["location"], var_name='Date', value_name='Value')
temp2 = pd.melt(df1, id_vars=["name"], var_name='Date', value_name='Value')

Concat whole temp1 with temp2‘s column name

temp1['new_column'] = temp2['name']

You now have what you asked for.

Question 60

`pd.wide_to_long`

You can add a prefix to your year columns and then feed directly to pd.wide_to_long. I won’t pretend this is efficient, but it may in certain situations be more convenient than pd.melt, e.g. when your columns already have an appropriate prefix.

df.columns = np.hstack((df.columns[:2], df.columns[2:].map(lambda x: f'Value{x}')))

res = pd.wide_to_long(df, stubnames=['Value'], i='name', j='Date').reset_index()\
        .sort_values(['location', 'name'])

print(res)

   name        Date location  Value
0  test    Jan-2010        A     12
2  test    Feb-2010        A     20
4  test  March-2010        A     30
1   foo    Jan-2010        B     18
3   foo    Feb-2010        B     20
5   foo  March-2010        B     25

Question 61

I do this in Python 2:

"(%d goals, $%d)" % (self.goals, self.penalties)

What is the Python 3 version of this?

I tried searching for examples online but I kept getting Python 2 versions.

Question 62

Here are the docs about the “new” format syntax. An example would be:

"({:d} goals, ${:d})".format(self.goals, self.penalties)

If both goals and penalties are integers (i.e. their default format is ok), it could be shortened to:

"({} goals, ${})".format(self.goals, self.penalties)

And since the parameters are fields of self, there’s also a way of doing it using a single argument twice (as @Burhan Khalid noted in the comments):

"({0.goals} goals, ${0.penalties})".format(self)

Explaining:

{} means just the next positional argument, with default format;
{0} means the argument with index 0, with default format;
{:d} is the next positional argument, with decimal integer format;
{0:d} is the argument with index 0, with decimal integer format.

There are many others things you can do when selecting an argument (using named arguments instead of positional ones, accessing fields, etc) and many format options as well (padding the number, using thousands separators, showing sign or not, etc). Some other examples:

"({goals} goals, ${penalties})".format(goals=2, penalties=4)
"({goals} goals, ${penalties})".format(**self.__dict__)

"first goal: {0.goal_list[0]}".format(self)
"second goal: {.goal_list[1]}".format(self)

"conversion rate: {:.2f}".format(self.goals / self.shots) # '0.20'
"conversion rate: {:.2%}".format(self.goals / self.shots) # '20.45%'
"conversion rate: {:.0%}".format(self.goals / self.shots) # '20%'

"self: {!s}".format(self) # 'Player: Bob'
"self: {!r}".format(self) # '<__main__.Player instance at 0x00BF7260>'

"games: {:>3}".format(player1.games)  # 'games: 123'
"games: {:>3}".format(player2.games)  # 'games:   4'
"games: {:0>3}".format(player2.games) # 'games: 004'

Note: As others pointed out, the new format does not supersede the former, both are available both in Python 3 and the newer versions of Python 2 as well. Some may say it’s a matter of preference, but IMHO the newer is much more expressive than the older, and should be used whenever writing new code (unless it’s targeting older environments, of course).

Question 63

Python 3.6 now supports shorthand literal string interpolation with PEP 498. For your use case, the new syntax is simply:

f"({self.goals} goals, ${self.penalties})"

This is similar to the previous .format standard, but lets one easily do things like:

>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal('12.34567')
>>> f'result: {value:{width}.{precision}}'
'result:      12.35'

Question 64

That line works as-is in Python 3.

>>> sys.version
'3.2 (r32:88445, Oct 20 2012, 14:09:29) \n[GCC 4.5.2]'
>>> "(%d goals, $%d)" % (self.goals, self.penalties)
'(1 goals, $2)'

问题：Python-如何在python中验证网址？（格式是否正确）

回答 0

django url验证正则表达式（源）：

django url validation regex (source):

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

使用urllib和类似Django的正则表达式验证URL

Python 3.7

说明

IPv6支持

例子

Validate URL with urllib and Django-like regex

Python 3.7

Explanation

IPv6 Support

Examples

回答 9

问题：合并两个列表并删除重复项，而不删除原始列表中的重复项

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：如何从Python的文件路径中提取文件夹路径？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何在Jupyter中显示完整输出，而不仅仅是最后结果？

回答 0

回答 1

回答 2

每个笔记本的基础

永久改变

Per Notebook Basis

Permanent Change

问题：matplotlib获取ylim值

回答 0

回答 1

回答 2

问题：Pandas DataFrame到列表列表

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：sklearn中的“ transform”和“ fit_transform”有什么区别

回答 0

回答 1

回答 2

回答 3

回答 4

。适合（）：

.fit_transform（）：

.fit():

.fit_transform():

回答 5

回答 6

为什么和何时使用每一个：

Why and When use each one:

问题：熊猫将某些列转换为行

回答 0

回答 1

使用`urllib`和类似Django的正则表达式验证URL

Validate URL with `urllib` and Django-like regex