Python 实用宝典

Question 1

Is there a quick way to “sub-flatten” or flatten only some of the first dimensions in a numpy array?

For example, given a numpy array of dimensions (50,100,25), the resultant dimensions would be (5000,25)

Question 2

>>> arr = numpy.zeros((50,100,25))
>>> arr.shape
# (50, 100, 25)

>>> new_arr = arr.reshape(5000,25)
>>> new_arr.shape   
# (5000, 25)

# One shape dimension can be -1. 
# In this case, the value is inferred from 
# the length of the array and remaining dimensions.
>>> another_arr = arr.reshape(-1, arr.shape[-1])
>>> another_arr.shape
# (5000, 25)

Question 3

A slight generalization to Alexander’s answer – np.reshape can take -1 as an argument, meaning “total array size divided by product of all other listed dimensions”:

e.g. to flatten all but the last dimension:

>>> arr = numpy.zeros((50,100,25))
>>> new_arr = arr.reshape(-1, arr.shape[-1])
>>> new_arr.shape
# (5000, 25)

Question 4

A slight generalization to Peter’s answer — you can specify a range over the original array’s shape if you want to go beyond three dimensional arrays.

e.g. to flatten all but the last two dimensions:

arr = numpy.zeros((3, 4, 5, 6))
new_arr = arr.reshape(-1, *arr.shape[-2:])
new_arr.shape
# (12, 5, 6)

EDIT: A slight generalization to my earlier answer — you can, of course, also specify a range at the beginning of the of the reshape too:

arr = numpy.zeros((3, 4, 5, 6, 7, 8))
new_arr = arr.reshape(*arr.shape[:2], -1, *arr.shape[-2:])
new_arr.shape
# (3, 4, 30, 7, 8)

Question 5

An alternative approach is to use numpy.resize() as in:

In [37]: shp = (50,100,25)
In [38]: arr = np.random.random_sample(shp)
In [45]: resized_arr = np.resize(arr, (np.prod(shp[:2]), shp[-1]))
In [46]: resized_arr.shape
Out[46]: (5000, 25)

# sanity check with other solutions
In [47]: resized = np.reshape(arr, (-1, shp[-1]))
In [48]: np.allclose(resized_arr, resized)
Out[48]: True

Question 6

I need to mark routines as deprecated, but apparently there’s no standard library decorator for deprecation. I am aware of recipes for it and the warnings module, but my question is: why is there no standard library decorator for this (common) task ?

Additional question: are there standard decorators in the standard library at all ?

Question 7

Here’s some snippet, modified from those cited by Leandro:

import warnings
import functools

def deprecated(func):
    """This is a decorator which can be used to mark functions
    as deprecated. It will result in a warning being emitted
    when the function is used."""
    @functools.wraps(func)
    def new_func(*args, **kwargs):
        warnings.simplefilter('always', DeprecationWarning)  # turn off filter
        warnings.warn("Call to deprecated function {}.".format(func.__name__),
                      category=DeprecationWarning,
                      stacklevel=2)
        warnings.simplefilter('default', DeprecationWarning)  # reset filter
        return func(*args, **kwargs)
    return new_func

# Examples

@deprecated
def some_old_function(x, y):
    return x + y

class SomeClass:
    @deprecated
    def some_old_method(self, x, y):
        return x + y

Because in some interpreters the first solution exposed (without filter handling) may result in a warning suppression.

Question 8

Here is another solution:

This decorator (a decorator factory in fact) allow you to give a reason message. It is also more useful to help the developer to diagnose the problem by giving the source filename and line number.

EDIT: This code use Zero’s recommendation: it replace warnings.warn_explicit line by warnings.warn(msg, category=DeprecationWarning, stacklevel=2), which prints the function call site rather than the function definition site. It makes debugging easier.

EDIT2: This version allow the developper to specify an optional “reason” message.

import functools
import inspect
import warnings

string_types = (type(b''), type(u''))


def deprecated(reason):
    """
    This is a decorator which can be used to mark functions
    as deprecated. It will result in a warning being emitted
    when the function is used.
    """

    if isinstance(reason, string_types):

        # The @deprecated is used with a 'reason'.
        #
        # .. code-block:: python
        #
        #    @deprecated("please, use another function")
        #    def old_function(x, y):
        #      pass

        def decorator(func1):

            if inspect.isclass(func1):
                fmt1 = "Call to deprecated class {name} ({reason})."
            else:
                fmt1 = "Call to deprecated function {name} ({reason})."

            @functools.wraps(func1)
            def new_func1(*args, **kwargs):
                warnings.simplefilter('always', DeprecationWarning)
                warnings.warn(
                    fmt1.format(name=func1.__name__, reason=reason),
                    category=DeprecationWarning,
                    stacklevel=2
                )
                warnings.simplefilter('default', DeprecationWarning)
                return func1(*args, **kwargs)

            return new_func1

        return decorator

    elif inspect.isclass(reason) or inspect.isfunction(reason):

        # The @deprecated is used without any 'reason'.
        #
        # .. code-block:: python
        #
        #    @deprecated
        #    def old_function(x, y):
        #      pass

        func2 = reason

        if inspect.isclass(func2):
            fmt2 = "Call to deprecated class {name}."
        else:
            fmt2 = "Call to deprecated function {name}."

        @functools.wraps(func2)
        def new_func2(*args, **kwargs):
            warnings.simplefilter('always', DeprecationWarning)
            warnings.warn(
                fmt2.format(name=func2.__name__),
                category=DeprecationWarning,
                stacklevel=2
            )
            warnings.simplefilter('default', DeprecationWarning)
            return func2(*args, **kwargs)

        return new_func2

    else:
        raise TypeError(repr(type(reason)))

You can use this decorator for functions, methods and classes.

Here is a simple example:

@deprecated("use another function")
def some_old_function(x, y):
    return x + y


class SomeClass(object):
    @deprecated("use another method")
    def some_old_method(self, x, y):
        return x + y


@deprecated("use another class")
class SomeOldClass(object):
    pass


some_old_function(5, 3)
SomeClass().some_old_method(8, 9)
SomeOldClass()

You’ll get:

deprecated_example.py:59: DeprecationWarning: Call to deprecated function or method some_old_function (use another function).
  some_old_function(5, 3)
deprecated_example.py:60: DeprecationWarning: Call to deprecated function or method some_old_method (use another method).
  SomeClass().some_old_method(8, 9)
deprecated_example.py:61: DeprecationWarning: Call to deprecated class SomeOldClass (use another class).
  SomeOldClass()

EDIT3: This decorator is now part of the Deprecated library:

New stable release v1.2.10 🎉

Question 9

As muon suggested, you can install the deprecation package for this.

The deprecation library provides a deprecated decorator and a fail_if_not_removed decorator for your tests.

Installation

pip install deprecation

Example Usage

import deprecation

@deprecation.deprecated(deprecated_in="1.0", removed_in="2.0",
                        current_version=__version__,
                        details="Use the bar function instead")
def foo():
    """Do some stuff"""
    return 1

See http://deprecation.readthedocs.io/ for the full documentation.

Question 10

I guess the reason is that Python code can’t be processed statically (as it done for C++ compilers), you can’t get warning about using some things before actually using it. I don’t think that it’s a good idea to spam user of your script with a bunch of messages “Warning: this developer of this script is using deprecated API”.

Update: but you can create decorator which will transform original function into another. New function will mark/check switch telling that this function was called already and will show message only on turning switch into on state. And/or at exit it may print list of all deprecated functions used in program.

Question 11

You can create a utils file

import warnings

def deprecated(message):
  def deprecated_decorator(func):
      def deprecated_func(*args, **kwargs):
          warnings.warn("{} is a deprecated function. {}".format(func.__name__, message),
                        category=DeprecationWarning,
                        stacklevel=2)
          warnings.simplefilter('default', DeprecationWarning)
          return func(*args, **kwargs)
      return deprecated_func
  return deprecated_decorator

And then import the deprecation decorator as follows:

from .utils import deprecated

@deprecated("Use method yyy instead")
def some_method()"
 pass

Question 12

UPDATE: I think is better, when we show DeprecationWarning only first time for each code line and when we can send some message:

import inspect
import traceback
import warnings
import functools

import time


def deprecated(message: str = ''):
    """
    This is a decorator which can be used to mark functions
    as deprecated. It will result in a warning being emitted
    when the function is used first time and filter is set for show DeprecationWarning.
    """
    def decorator_wrapper(func):
        @functools.wraps(func)
        def function_wrapper(*args, **kwargs):
            current_call_source = '|'.join(traceback.format_stack(inspect.currentframe()))
            if current_call_source not in function_wrapper.last_call_source:
                warnings.warn("Function {} is now deprecated! {}".format(func.__name__, message),
                              category=DeprecationWarning, stacklevel=2)
                function_wrapper.last_call_source.add(current_call_source)

            return func(*args, **kwargs)

        function_wrapper.last_call_source = set()

        return function_wrapper
    return decorator_wrapper


@deprecated('You must use my_func2!')
def my_func():
    time.sleep(.1)
    print('aaa')
    time.sleep(.1)


def my_func2():
    print('bbb')


warnings.simplefilter('always', DeprecationWarning)  # turn off filter
print('before cycle')
for i in range(5):
    my_func()
print('after cycle')
my_func()
my_func()
my_func()

Result:

before cycle
C:/Users/adr-0/OneDrive/Projects/Python/test/unit1.py:45: DeprecationWarning: Function my_func is now deprecated! You must use my_func2!
aaa
aaa
aaa
aaa
aaa
after cycle
C:/Users/adr-0/OneDrive/Projects/Python/test/unit1.py:47: DeprecationWarning: Function my_func is now deprecated! You must use my_func2!
aaa
C:/Users/adr-0/OneDrive/Projects/Python/test/unit1.py:48: DeprecationWarning: Function my_func is now deprecated! You must use my_func2!
aaa
C:/Users/adr-0/OneDrive/Projects/Python/test/unit1.py:49: DeprecationWarning: Function my_func is now deprecated! You must use my_func2!
aaa

Process finished with exit code 0

We can just click on the warning path and go to the line in PyCharm.

Question 13

Augmenting this answer by Steven Vascellaro:

If you use Anaconda, first install deprecation package:

conda install -c conda-forge deprecation

Then paste the following on the top of the file

import deprecation

@deprecation.deprecated(deprecated_in="1.0", removed_in="2.0",
                    current_version=__version__,
                    details="Use the bar function instead")
def foo():
    """Do some stuff"""
    return 1

See http://deprecation.readthedocs.io/ for the full documentation.

Question 14

I have users table in my MySql database. This table has id, name and age fields.

How can I delete some record by id?

Now I use the following code:

user = User.query.get(id)
db.session.delete(user)
db.session.commit()

But I don’t want to make any query before delete operation. Is there any way to do this? I know, I can use db.engine.execute("delete from users where id=..."), but I would like to use delete() method.

Question 15

You can do this,

User.query.filter_by(id=123).delete()

or

User.query.filter(User.id == 123).delete()

Make sure to commit for delete() to take effect.

Question 16

Just want to share another option:

# mark two objects to be deleted
session.delete(obj1)
session.delete(obj2)

# commit (or flush)
session.commit()

http://docs.sqlalchemy.org/en/latest/orm/session_basics.html#deleting

In this example, the following codes shall works fine:

obj = User.query.filter_by(id=123).one()
session.delete(obj)
session.commit()

Question 17

Another possible solution specially if you want batch delete

deleted_objects = User.__table__.delete().where(User.id.in_([1, 2, 3]))
session.execute(deleted_objects)
session.commit()

Question 18

I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.

Any ideas?

Question 19

@bogatron has it right, you can use where, it’s worth noting that you can do this natively in pandas:

df1 = df.where(pd.notnull(df), None)

Note: this changes the dtype of all columns to object.

Example:

In [1]: df = pd.DataFrame([1, np.nan])

In [2]: df
Out[2]: 
    0
0   1
1 NaN

In [3]: df1 = df.where(pd.notnull(df), None)

In [4]: df1
Out[4]: 
      0
0     1
1  None

Note: what you cannot do recast the DataFrames dtype to allow all datatypes types, using astype, and then the DataFrame fillna method:

df1 = df.astype(object).replace(np.nan, 'None')

Unfortunately neither this, nor using replace, works with None see this (closed) issue.

As an aside, it’s worth noting that for most use cases you don’t need to replace NaN with None, see this question about the difference between NaN and None in pandas.

However, in this specific case it seems you do (at least at the time of this answer).

Question 20

df = df.replace({np.nan: None})

Credit goes to this guy here on this Github issue.

Question 21

You can replace nan with None in your numpy array:

>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>

Question 22

After stumbling around, this worked for me:

df = df.astype(object).where(pd.notnull(df),None)

Question 23

Just an addition to @Andy Hayden’s answer:

Since DataFrame.mask is the opposite twin of DataFrame.where, they have the exactly same signature but with opposite meaning:

DataFrame.where is useful for Replacing values where the condition is False.
DataFrame.mask is used for Replacing values where the condition is True.

So in this question, using df.mask(df.isna(), other=None, inplace=True) might be more intuitive.

Question 24

Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None‘s won’t flip back to np.NaN‘s apply @andy-hayden’s suggestion with using pd.where. Illustration of how replace can still go ‘wrong’:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})

In [4]: df
Out[4]:
     a
0  1.0
1  NaN
2  inf

In [5]: df.replace({np.NAN: None})
Out[5]:
      a
0     1
1  None
2   inf

In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
     a
0  1.0
1  NaN
2  NaN

In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
     a
0  1.0
1  NaN
2  NaN

Question 25

Quite old, yet I stumbled upon the very same issue. Try doing this:

df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)

Question 26

I have the following pandas dataframe Top15:

I create a column that estimates the number of citable documents per person:

Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']

I want to know the correlation between the number of citable documents per capita and the energy supply per capita. So I use the .corr() method (Pearson’s correlation):

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

I want to return a single number, but the result is:

Question 27

Without actual data it is hard to answer the question but I guess you are looking for something like this:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

That calculates the correlation between your two columns 'Citable docs per Capita' and 'Energy Supply per Capita'.

To give an example:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

Then

df['A'].corr(df['B'])

gives 1 as expected.

Now, if you change a value, e.g.

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

the command

df['A'].corr(df['B'])

returns

0.99586

which is still close to 1, as expected.

If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns; that’s why you then observe 1s at the diagonal of your matrix (each column is perfectly correlated with itself).

df.corr()

will therefore return

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get NaNs in your solution – check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

Question 28

I ran into the same issue. It appeared Citable Documents per Person was a float, and python skips it somehow by default. All the other columns of my dataframe were in numpy-formats, so I solved it by converting the columnt to np.float64

Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])

Remember it’s exactly the column you calculated yourself

Question 29

My solution would be after converting data to numerical type:

Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()

Question 30

If you want the correlations between all pairs of columns, you could do something like this:

import pandas as pd
import numpy as np

def get_corrs(df):
    col_correlations = df.corr()
    col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
    cor_pairs = col_correlations.stack()
    return cor_pairs.to_dict()

my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])

Question 31

When you call this:

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

Since, DataFrame.corr() function performs pair-wise correlations, you have four pair from two variables. So, basically you are getting diagonal values as auto correlation (correlation with itself, two values since you have two variables), and other two values as cross correlations of one vs another and vice versa.

Either perform correlation between two series to get a single value:

from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)

or, if you want a single value from the same function (DataFrame’s corr):

single_value = correlation[0][1]

Hope this helps.

Question 32

It works like this:

Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])

Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])

Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])

Question 33

I solved this problem by changing the data type. If you see the ‘Energy Supply per Capita’ is a numerical type while the ‘Citable docs per Capita’ is an object type. I converted the column to float using astype. I had the same problem with some np functions: count_nonzero and sum worked while mean and std didn’t.

Question 34

changing ‘Citable docs per Capita’ to numeric before correlation will solve the problem.

    Top15['Citable docs per Capita'] = pd.to_numeric(Top15['Citable docs per Capita'])
    data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
    correlation = data.corr(method='pearson')

Question 35

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have run

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

I tried using

mat[np.isfinite(mat) == True] = 0

to remove the infinite values but this did not work either. What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?

I am using anaconda and python 2.7.9.

Question 36

This might happen inside scikit, and it depends on what you’re doing. I recommend reading the documentation for the functions you’re using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check wheter any of the element is NaN, and not whether the return value of the any function is a number…

Question 37

I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

df = df[df.label=='desired_one']

Question 38

This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

import pandas as pd

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

Question 39

The Dimensions of my input array were skewed, as my input csv had empty spaces.

Question 40

This is the check on which it fails:

https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51

Which says

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.

Question 41

With this version of python 3:

/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)

Looking at the details of the error, I found the lines of codes causing the failure:

/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     56             and not np.isfinite(X).all()):
     57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
     59 
     60 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)

Then with a quick and dirty loop, I was able to find that my data indeed contains nans:

print(p[:,0].shape)
index = 0
for i in p[:,0]:
    if not np.isfinite(i):
        print(index, i)
    index +=1

(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...

Now all I have to do is remove the values at these indexes.

Question 42

I had the error after trying to select a subset of rows:

df = df.reindex(index=my_index)

Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.

Question 43

In most cases getting rid of infinite and null values solve this problem.

get rid of infinite values.

df.replace([np.inf, -np.inf], np.nan, inplace=True)

get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values

df.fillna(999, inplace=True)

Question 44

I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:

X = X.values.astype(np.float)
y = y.values.astype(np.float)

Edit: The originally suggested X.as_matrix() is Deprecated

Question 45

i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc

Question 46

In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.

Question 47

Remove all infinite values:

(and replace with min or max for that column)

# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]

# go through matrix one column at a time and replace  + and -infinity 
# with the max or min for that column
for i in range(log_train_arr.shape[1]):
    matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
    matrix[:, i][matrix[:, i] == np.inf] = maxs[i]

Question 48

try

mat.sum()

If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.

see the _assert_all_finite function in validation.py from the scikit source code:

if is_float and np.isfinite(X.sum()):
    pass
elif is_float:
    msg_err = "Input contains {} or a value too large for {!r}."
    if (allow_nan and np.isinf(X).any() or
            not allow_nan and not np.isfinite(X).all()):
        type_err = 'infinity' if allow_nan else 'NaN, infinity'
        # print(X.sum())
        raise ValueError(msg_err.format(type_err, X.dtype))

Question 49

I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:

How do I fake a browser visit by using python requests or command wget?

http://www.ichangtou.com/#company:data_000008.html

Question 50

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:

List of all Browsers

As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

Question 51

if this question is still valid

I used fake UserAgent

How to use:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

outPut:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

Question 52

Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

Question 53

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.

Question 54

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
})

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)

Question 55

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

Question 56

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

The keywords are the output column names

The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

Question 57

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Question 58

I am using the Boto 3 python library, and want to connect to AWS CloudFront. I need to specify the correct AWS Profile (AWS Credentials), but looking at the official documentation, I see no way to specify it.

I am initializing the client using the code: client = boto3.client('cloudfront')

However, this results in it using the default profile to connect. I couldn’t find a method where I can specify which profile to use.

Question 59

I think the docs aren’t wonderful at exposing how to do this. It has been a supported feature for some time, however, and there are some details in this pull request.

So there are three different ways to do this:

Option A) Create a new session with the profile

    dev = boto3.session.Session(profile_name='dev')

Option B) Change the profile of the default session in code

    boto3.setup_default_session(profile_name='dev')

Option C) Change the profile of the default session with an environment variable

    $ AWS_PROFILE=dev ipython
    >>> import boto3
    >>> s3dev = boto3.resource('s3')

Question 60

Do this to use a profile with name ‘dev’:

session = boto3.session.Session(profile_name='dev')
s3 = session.resource('s3')
for bucket in s3.buckets.all():
    print(bucket.name)

Question 61

This section of the boto3 documentation is helpful.

Here’s what worked for me:

session = boto3.Session(profile_name='dev')
client = session.client('cloudfront')

Question 62

Just add profile to session configuration before client call. boto3.session.Session(profile_name='YOUR_PROFILE_NAME').client('cloudwatch')

Question 63

I found out about the // operator in Python which in Python 3 does division with floor.

Is there an operator which divides with ceil instead? (I know about the / operator which in Python 3 does floating point division.)

Question 64

There is no operator which divides with ceil. You need to import math and use math.ceil

问题：如何仅展平numpy数组的某些尺寸

回答 0

回答 1

回答 2

回答 3

问题：python标准库中的装饰器（@deprecated）

回答 0

回答 1

回答 2

安装

用法示例

Installation

Example Usage

回答 3

回答 4

回答 5

回答 6

问题：如何在Flask-SQLAlchemy中按ID删除记录

回答 0

回答 1

回答 2

问题：用None替换Pandas或Numpy Nan以与MysqlDB一起使用

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：使用.corr获取两列之间的相关性

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：sklearn错误ValueError：输入包含NaN，无穷大或对于dtype（’float64’）而言太大的值

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

删除所有无限值：

（并用该列的min或max代替）

Remove all infinite values:

(and replace with min or max for that column)

回答 12

问题：如何使用Python请求伪造浏览器访问？

回答 0

回答 1

回答 2

用法：

USAGE:

回答 3

问题：使用pandas GroupBy.agg（）对同一列进行多次聚合

回答 0

回答 1

大熊猫> = 0.25：命名汇总

熊猫<0.25

Pandas >= 0.25: Named Aggregation

Pandas < 0.25

回答 2

问题：使用boto3连接到CloudFront时如何选择AWS配置文件

回答 0

回答 1

回答 2

回答 3

问题：Python中是否有一个//运算符的上限？

回答 0

回答 1

回答 2