Python 实用宝典

Question 1

I have a pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame(
    data={
        "teams": [
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
        ]
    }
)

print(df)

Output:

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into 2 columns?

Question 2

You can use DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Question 3

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

Question 4

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here’s the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Question 5

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I’m assuming that the column is called ‘meta’ in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

Question 6

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 7

The above solutions didn’t work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using list comprehension. Here the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

solving with list comprehension:

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

Question 8

list comprehension

simple implementation with list comprehension ( my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

output:

team_1  team_2
0   SF  NYG
1   SF  NYG
2   SF  NYG
3   SF  NYG
4   SF  NYG
5   SF  NYG
6   SF  NYG

Question 9

Here’s another solution using df.transform and df.set_index:

>>> (df['teams']
       .transform([lambda x:x[0], lambda x:x[1]])
       .set_axis(['team1','team2'],
                  axis=1,
                  inplace=False)
    )

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Question 10

In PHP you can just use $_POST for POST and $_GET for GET (Query string) variables. What’s the equivalent in Python?

Question 11

suppose you’re posting a html form with this:

<input type="text" name="username">

If using raw cgi:

import cgi
form = cgi.FieldStorage()
print form["username"]

If using Django, Pylons, Flask or Pyramid:

print request.GET['username'] # for GET form method
print request.POST['username'] # for POST form method

Using Turbogears, Cherrypy:

from cherrypy import request
print request.params['username']

Web.py:

form = web.input()
print form.username

Werkzeug:

print request.form['username']

If using Cherrypy or Turbogears, you can also define your handler function taking a parameter directly:

def index(self, username):
    print username

Google App Engine:

class SomeHandler(webapp2.RequestHandler):
    def post(self):
        name = self.request.get('username') # this will get the value from the field named username
        self.response.write(name) # this will write on the document

So you really will have to choose one of those frameworks.

Question 12

I know this is an old question. Yet it’s surprising that no good answer was given.

First of all the question is completely valid without mentioning the framework. The CONTEXT is a PHP language equivalence. Although there are many ways to get the query string parameters in Python, the framework variables are just conveniently populated. In PHP, $_GET and $_POST are also convenience variables. They are parsed from QUERY_URI and php://input respectively.

In Python, these functions would be os.getenv('QUERY_STRING') and sys.stdin.read(). Remember to import os and sys modules.

We have to be careful with the word “CGI” here, especially when talking about two languages and their commonalities when interfacing with a web server. 1. CGI, as a protocol, defines the data transport mechanism in the HTTP protocol. 2. Python can be configured to run as a CGI-script in Apache. 3. The CGI module in Python offers some convenience functions.

Since the HTTP protocol is language-independent, and that Apache’s CGI extension is also language-independent, getting the GET and POST parameters should bear only syntax differences across languages.

Here’s the Python routine to populate a GET dictionary:

GET={}
args=os.getenv("QUERY_STRING").split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k,v=arg.split('='); GET[k]=v

and for POST:

POST={}
args=sys.stdin.read().split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k, v=arg.split('='); POST[k]=v

You can now access the fields as following:

print GET.get('user_id')
print POST.get('user_name')

I must also point out that the CGI module doesn’t work well. Consider this HTTP request:

POST / test.py?user_id=6

user_name=Bob&age=30

Using CGI.FieldStorage().getvalue('user_id') will cause a null pointer exception because the module blindly checks the POST data, ignoring the fact that a POST request can carry GET parameters too.

Question 13

I’ve found nosklo’s answer very extensive and useful! For those, like myself, who might find accessing the raw request data directly also useful, I would like to add the way to do that:

import os, sys

# the query string, which contains the raw GET data
# (For example, for http://example.com/myscript.py?a=b&c=d&e
# this is "a=b&c=d&e")
os.getenv("QUERY_STRING")

# the raw POST data
sys.stdin.read()

Question 14

They are stored in the CGI fieldstorage object.

import cgi
form = cgi.FieldStorage()

print "The user entered %s" % form.getvalue("uservalue")

Question 15

It somewhat depends on what you use as a CGI framework, but they are available in dictionaries accessible to the program. I’d point you to the docs, but I’m not getting through to python.org right now. But this note on mail.python.org will give you a first pointer. Look at the CGI and URLLIB Python libs for more.

Update

Okay, that link busted. Here’s the basic wsgi ref

Question 16

Python is only a language, to get GET and POST data, you need a web framework or toolkit written in Python. Django is one, as Charlie points out, the cgi and urllib standard modules are others. Also available are Turbogears, Pylons, CherryPy, web.py, mod_python, fastcgi, etc, etc.

In Django, your view functions receive a request argument which has request.GET and request.POST. Other frameworks will do it differently.

Question 17

Is the list of Python reserved words and builtins available in a library? I want to do something like:

 from x.y import reserved_words_and_builtins

 if x in reserved_words_and_builtins:
     x += '_'

Question 18

To verify that a string is a keyword you can use keyword.iskeyword; to get the list of reserved keywords you can use keyword.kwlist:

>>> import keyword
>>> keyword.iskeyword('break')
True
>>> keyword.kwlist
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 
 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 
 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 
 'while', 'with', 'yield']

If you want to include built-in names as well (Python 3), then check the builtins module:

>>> import builtins
>>> dir(builtins)
['ArithmeticError', 'AssertionError', 'AttributeError',
 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning',
 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError',
 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError',
 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError',
 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError',
 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError',
 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError',
 'MemoryError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented',
 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning',
 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError',
 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration',
 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit',
 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError',
 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_',
 '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__',
 '__package__', '__spec__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool',
 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex',
 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval',
 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr',
 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int',
 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map',
 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow',
 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set',
 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple',
 'type', 'vars', 'zip']

For Python 2 you’ll need to use the __builtin__ module

>>> import __builtin__
>>> dir(__builtin__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']

Question 19

I’m working with a large csv file and the next to last column has a string of text that I want to split by a specific delimiter. I was wondering if there is a simple way to do this using pandas or python?

CustNum  CustomerName     ItemQty  Item   Seatblocks                 ItemExt
32363    McCartney, Paul      3     F04    2:218:10:4,6                   60
31316    Lennon, John        25     F01    1:13:36:1,12 1:13:37:1,13     300

I want to split by the space(' ') and then the colon(':') in the Seatblocks column, but each cell would result in a different number of columns. I have a function to rearrange the columns so the Seatblocks column is at the end of the sheet, but I’m not sure what to do from there. I can do it in excel with the built in text-to-columns function and a quick macro, but my dataset has too many records for excel to handle.

Ultimately, I want to take records such John Lennon’s and create multiple lines, with the info from each set of seats on a separate line.

Question 20

This splits the Seatblocks by space and gives each its own row.

In [43]: df
Out[43]: 
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

In [44]: s = df['Seatblocks'].str.split(' ').apply(Series, 1).stack()

In [45]: s.index = s.index.droplevel(-1) # to line up with df's index

In [46]: s.name = 'Seatblocks' # needs a name to join

In [47]: s
Out[47]: 
0    2:218:10:4,6
1    1:13:36:1,12
1    1:13:37:1,13
Name: Seatblocks, dtype: object

In [48]: del df['Seatblocks']

In [49]: df.join(s)
Out[49]: 
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

Or, to give each colon-separated string in its own column:

In [50]: df.join(s.apply(lambda x: Series(x.split(':'))))
Out[50]: 
   CustNum     CustomerName  ItemQty Item  ItemExt  0    1   2     3
0    32363  McCartney, Paul        3  F04       60  2  218  10   4,6
1    31316     Lennon, John       25  F01      300  1   13  36  1,12
1    31316     Lennon, John       25  F01      300  1   13  37  1,13

This is a little ugly, but maybe someone will chime in with a prettier solution.

Question 21

Differently from Dan, I consider his answer quite elegant… but unfortunately it is also very very inefficient. So, since the question mentioned “a large csv file”, let me suggest to try in a shell Dan’s solution:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df['col'].apply(lambda x : pd.Series(x.split(' '))).head()"

… compared to this alternative:

time python -c "import pandas as pd;
from scipy import array, concatenate;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(concatenate(df['col'].apply( lambda x : [x.split(' ')]))).head()"

… and this:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))).head()"

The second simply refrains from allocating 100 000 Series, and this is enough to make it around 10 times faster. But the third solution, which somewhat ironically wastes a lot of calls to str.split() (it is called once per column per row, so three times more than for the others two solutions), is around 40 times faster than the first, because it even avoids to instance the 100 000 lists. And yes, it is certainly a little ugly…

EDIT: this answer suggests how to use “to_list()” and to avoid the need for a lambda. The result is something like

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(df.col.str.split().tolist()).head()"

which is even more efficient than the third solution, and certainly much more elegant.

EDIT: the even simpler

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(list(df.col.str.split())).head()"

works too, and is almost as efficient.

EDIT: even simpler! And handles NaNs (but less efficient):

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df.col.str.split(expand=True).head()"

Question 22

import pandas as pd
import numpy as np

df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt'])

print (df)
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

Another similar solution with chaining is use reset_index and rename:

print (df.drop('Seatblocks', axis=1)
             .join
             (
             df.Seatblocks
             .str
             .split(expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('Seatblocks')           
             ))

   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

If in column are NOT NaN values, the fastest solution is use list comprehension with DataFrame constructor:

df = pd.DataFrame(['a b c']*100000, columns=['col'])

In [141]: %timeit (pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))))
1 loop, best of 3: 211 ms per loop

In [142]: %timeit (pd.DataFrame(df.col.str.split().tolist()))
10 loops, best of 3: 87.8 ms per loop

In [143]: %timeit (pd.DataFrame(list(df.col.str.split())))
10 loops, best of 3: 86.1 ms per loop

In [144]: %timeit (df.col.str.split(expand=True))
10 loops, best of 3: 156 ms per loop

In [145]: %timeit (pd.DataFrame([ x.split() for x in df['col'].tolist()]))
10 loops, best of 3: 54.1 ms per loop

But if column contains NaN only works str.split with parameter expand=True which return DataFrame (documentation), and it explain why it is slowier:

df = pd.DataFrame(['a b c']*10, columns=['col'])
df.loc[0] = np.nan
print (df.head())
     col
0    NaN
1  a b c
2  a b c
3  a b c
4  a b c

print (df.col.str.split(expand=True))
     0     1     2
0  NaN  None  None
1    a     b     c
2    a     b     c
3    a     b     c
4    a     b     c
5    a     b     c
6    a     b     c
7    a     b     c
8    a     b     c
9    a     b     c

Question 23

Another approach would be like this:

temp = df['Seatblocks'].str.split(' ')
data = data.reindex(data.index.repeat(temp.apply(len)))
data['new_Seatblocks'] = np.hstack(temp)

Question 24

Can also use groupby() with no need to join and stack().

Use above example data:

import pandas as pd
import numpy as np


df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt']) 
print(df)

   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0  32363    McCartney, Paul  3        F04  2:218:10:4,6               60     
1  31316    Lennon, John     25       F01  1:13:36:1,12 1:13:37:1,13  300  


#first define a function: given a Series of string, split each element into a new series
def split_series(ser,sep):
    return pd.Series(ser.str.cat(sep=sep).split(sep=sep)) 
#test the function, 
split_series(pd.Series(['a b','c']),sep=' ')
0    a
1    b
2    c
dtype: object

df2=(df.groupby(df.columns.drop('Seatblocks').tolist()) #group by all but one column
          ['Seatblocks'] #select the column to be split
          .apply(split_series,sep=' ') # split 'Seatblocks' in each group
         .reset_index(drop=True,level=-1).reset_index()) #remove extra index created

print(df2)
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13
2    32363  McCartney, Paul        3  F04       60  2:218:10:4,6

Question 25

This seems a far easier method than those suggested elsewhere in this thread.

split rows in pandas dataframe

Question 26

Is there a less verbose alternative to this:

for x in xrange(array.shape[0]):
    for y in xrange(array.shape[1]):
        do_stuff(x, y)

I came up with this:

for x, y in itertools.product(map(xrange, array.shape)):
    do_stuff(x, y)

Which saves one indentation, but is still pretty ugly.

I’m hoping for something that looks like this pseudocode:

for x, y in array.indices:
    do_stuff(x, y)

Does anything like that exist?

Question 27

I think you’re looking for the ndenumerate.

>>> a =numpy.array([[1,2],[3,4],[5,6]])
>>> for (x,y), value in numpy.ndenumerate(a):
...  print x,y
... 
0 0
0 1
1 0
1 1
2 0
2 1

Regarding the performance. It is a bit slower than a list comprehension.

X = np.zeros((100, 100, 100))

%timeit list([((i,j,k), X[i,j,k]) for i in range(X.shape[0]) for j in range(X.shape[1]) for k in range(X.shape[2])])
1 loop, best of 3: 376 ms per loop

%timeit list(np.ndenumerate(X))
1 loop, best of 3: 570 ms per loop

If you are worried about the performance you could optimise a bit further by looking at the implementation of ndenumerate, which does 2 things, converting to an array and looping. If you know you have an array, you can call the .coords attribute of the flat iterator.

a = X.flat
%timeit list([(a.coords, x) for x in a.flat])
1 loop, best of 3: 305 ms per loop

Question 28

If you only need the indices, you could try numpy.ndindex:

>>> a = numpy.arange(9).reshape(3, 3)
>>> [(x, y) for x, y in numpy.ndindex(a.shape)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

Question 29

see nditer

import numpy as np
Y = np.array([3,4,5,6])
for y in np.nditer(Y, op_flags=['readwrite']):
    y += 3

Y == np.array([6, 7, 8, 9])

y = 3 would not work, use y *= 0 and y += 3 instead.

Question 30

How can I change any data type into a string in Python?

Question 31

myvariable = 4
mystring = str(myvariable)  # '4'

also, alternatively try repr:

mystring = repr(myvariable) # '4'

This is called “conversion” in python, and is quite common.

Question 32

str is meant to produce a string representation of the object’s data. If you’re writing your own class and you want str to work for you, add:

def __str__(self):
    return "Some descriptive string"

print str(myObj) will call myObj.__str__().

repr is a similar method, which generally produces information on the class info. For most core library object, repr produces the class name (and sometime some class information) between angle brackets. repr will be used, for example, by just typing your object into your interactions pane, without using print or anything else.

You can define the behavior of repr for your own objects just like you can define the behavior of str:

def __repr__(self):
    return "Some descriptive string"

>>> myObj in your interactions pane, or repr(myObj), will result in myObj.__repr__()

Question 33

I see all answers recommend using str(object). It might fail if your object have more than ascii characters and you will see error like ordinal not in range(128). This was the case for me while I was converting list of string in language other than English

I resolved it by using unicode(object)

Question 34

str(object) will do the trick.

If you want to alter the way object is stringified, define __str__(self) method for object’s class. Such method has to return str or unicode object.

Question 35

Use the str built-in:

x = str(something)

Examples:

>>> str(1)
'1'
>>> str(1.0)
'1.0'
>>> str([])
'[]'
>>> str({})
'{}'

...

From the documentation:

Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ”.

Question 36

With str(x). However, every data type can define its own string conversion, so this might not be what you want.

Question 37

You can use %s like below

>>> "%s" %([])
'[]'

Question 38

Just use str – for example:

>>> str([])
'[]'

Question 39

Use formatting:

"%s" % (x)

Example:

x = time.ctime(); str = "%s" % (x); print str

Output: Thu Jan 11 20:40:05 2018

Question 40

Be careful if you really want to “change” the data type. Like in other cases (e.g. changing the iterator in a for loop) this might bring up unexpected behaviour:

>> dct = {1:3, 2:1}
>> len(str(dct))
12
>> print(str(dct))
{1: 31, 2: 0}
>> l = ["all","colours"]
>> len(str(l))
18

Question 41

How do you get Jenkins to execute python unittest cases? Is it possible to JUnit style XML output from the builtin unittest package?

Question 42

sample tests:

tests.py:

# tests.py

import random
try:
    import unittest2 as unittest
except ImportError:
    import unittest

class SimpleTest(unittest.TestCase):
    @unittest.skip("demonstrating skipping")
    def test_skipped(self):
        self.fail("shouldn't happen")

    def test_pass(self):
        self.assertEqual(10, 7 + 3)

    def test_fail(self):
        self.assertEqual(11, 7 + 3)

JUnit with pytest

run the tests with:

py.test --junitxml results.xml tests.py

results.xml:

<?xml version="1.0" encoding="utf-8"?>
<testsuite errors="0" failures="1" name="pytest" skips="1" tests="2" time="0.097">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000301837921143">
        <failure message="test failure">self = &lt;tests.SimpleTest testMethod=test_fail&gt;

    def test_fail(self):
&gt;       self.assertEqual(11, 7 + 3)
E       AssertionError: 11 != 10

tests.py:16: AssertionError</failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000109910964966"/>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000164031982422">
        <skipped message="demonstrating skipping" type="pytest.skip">/home/damien/test-env/lib/python2.6/site-packages/_pytest/unittest.py:119: Skipped: demonstrating skipping</skipped>
    </testcase>
</testsuite>

JUnit with nose

run the tests with:

nosetests --with-xunit

nosetests.xml:

<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="nosetests" tests="3" errors="0" failures="1" skip="1">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000">
        <failure type="exceptions.AssertionError" message="11 != 10">
            <![CDATA[Traceback (most recent call last):
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 340, in run
testMethod()
File "/home/damien/tests.py", line 16, in test_fail
self.assertEqual(11, 7 + 3)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 521, in assertEqual
assertion_func(first, second, msg=msg)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 514, in _baseAssertEqual
raise self.failureException(msg)
AssertionError: 11 != 10
]]>
        </failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000"></testcase>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000">
        <skipped type="nose.plugins.skip.SkipTest" message="demonstrating skipping">
            <![CDATA[SkipTest: demonstrating skipping
]]>
        </skipped>
    </testcase>
</testsuite>

JUnit with nose2

You would need to use the nose2.plugins.junitxml plugin. You can configure nose2 with a config file like you would normally do, or with the --plugin command-line option.

run the tests with:

nose2 --plugin nose2.plugins.junitxml --junit-xml tests

nose2-junit.xml:

<testsuite errors="0" failures="1" name="nose2-junit" skips="1" tests="3" time="0.001">
  <testcase classname="tests.SimpleTest" name="test_fail" time="0.000126">
    <failure message="test failure">Traceback (most recent call last):
  File "/Users/damien/Work/test2/tests.py", line 18, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
</failure>
  </testcase>
  <testcase classname="tests.SimpleTest" name="test_pass" time="0.000095" />
  <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000058">
    <skipped />
  </testcase>
</testsuite>

JUnit with unittest-xml-reporting

Append the following to tests.py

if __name__ == '__main__':
    import xmlrunner
    unittest.main(testRunner=xmlrunner.XMLTestRunner(output='test-reports'))

run the tests with:

python tests.py

test-reports/TEST-SimpleTest-20131001140629.xml:

<?xml version="1.0" ?>
<testsuite errors="1" failures="0" name="SimpleTest-20131001140629" tests="3" time="0.000">
    <testcase classname="SimpleTest" name="test_pass" time="0.000"/>
    <testcase classname="SimpleTest" name="test_fail" time="0.000">
        <error message="11 != 10" type="AssertionError">
<![CDATA[Traceback (most recent call last):
  File "tests.py", line 16, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
]]>     </error>
    </testcase>
    <testcase classname="SimpleTest" name="test_skipped" time="0.000">
        <skipped message="demonstrating skipping" type="skip"/>
    </testcase>
    <system-out>
<![CDATA[]]>    </system-out>
    <system-err>
<![CDATA[]]>    </system-err>
</testsuite>

Question 43

I would second using nose. Basic XML reporting is now built in. Just use the –with-xunit command line option and it will produce a nosetests.xml file. For example:

nosetests –with-xunit

Then add a “Publish JUnit test result report” post build action, and fill in the “Test report XMLs” field with nosetests.xml (assuming that you ran nosetests in $WORKSPACE).

Question 44

You can install the unittest-xml-reporting package to add a test runner that generates XML to the built-in unittest.

We use pytest, which has XML output built in (it’s a command line option).

Either way, executing the unit tests can be done by running a shell command.

Question 45

I used nosetests. There are addons to output the XML for Jenkins

Question 46

When using buildout we use collective.xmltestreport to produce JUnit-style XML output, perhaps it’s source code or the module itself could be of help.

Question 47

python -m pytest --junit-xml=pytest_unit.xml source_directory/test/unit || true # tests may fail

Run this as shell from jenkins , you can get the report in pytest_unit.xml as artifact.

Question 48

When using setuptools, I can not get the installer to pull in any package_data files. Everything I’ve read says that the following is the correct way to do it. Can someone please advise?

setup(
   name='myapp',
   packages=find_packages(),
   package_data={
      'myapp': ['data/*.txt'],
   },
   include_package_data=True,
   zip_safe=False,
   install_requires=['distribute'],
)

where myapp/data/ is the location of the data files.

Question 49

I realize that this is an old question, but for people finding their way here via Google: package_data is a low-down, dirty lie. It is only used when building binary packages (python setup.py bdist ...) but not when building source packages (python setup.py sdist ...). This is, of course, ridiculous — one would expect that building a source distribution would result in a collection of files that could be sent to someone else to built the binary distribution.

In any case, using MANIFEST.in will work both for binary and for source distributions.

Question 50

I just had this same issue. The solution, was simply to remove include_package_data=True.

After reading here, I realized that include_package_data aims to include files from version control, as opposed to merely “include package data” as the name implies. From the docs:

The data files [of include_package_data] must be under CVS or Subversion control

…

If you want finer-grained control over what files are included (for example, if you have documentation files in your package directories and want to exclude them from installation), then you can also use the package_data keyword.

Taking that argument out fixed it, which is coincidentally why it also worked when you switched to distutils, since it doesn’t take that argument.

Question 51

Following @Joe ‘s recommendation to remove the include_package_data=True line also worked for me.

To elaborate a bit more, I have no MANIFEST.in file. I use Git and not CVS.

Repository takes this kind of shape:

/myrepo
    - .git/
    - setup.py
    - myproject
        - __init__.py
        - some_mod
            - __init__.py
            - animals.py
            - rocks.py
        - config
            - __init__.py
            - settings.py
            - other_settings.special
            - cool.huh
            - other_settings.xml
        - words
            - __init__.py
            word_set.txt

setup.py:

from setuptools import setup, find_packages
import os.path

setup (
    name='myproject',
    version = "4.19",
    packages = find_packages(),  
    # package_dir={'mypkg': 'src/mypkg'},  # didnt use this.
    package_data = {
        # If any package contains *.txt or *.rst files, include them:
        '': ['*.txt', '*.xml', '*.special', '*.huh'],
    },

#
    # Oddly enough, include_package_data=True prevented package_data from working.
    # include_package_data=True, # Commented out.
    data_files=[
#               ('bitmaps', ['bm/b1.gif', 'bm/b2.gif']),
        ('/opt/local/myproject/etc', ['myproject/config/settings.py', 'myproject/config/other_settings.special']),
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'cool.huh')]),
#
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'other_settings.xml')]),
        ('/opt/local/myproject/data', [os.path.join('myproject/words', 'word_set.txt')]),
    ],

    install_requires=[ 'jsonschema',
        'logging', ],

     entry_points = {
        'console_scripts': [
            # Blah...
        ], },
)

I run python setup.py sdist for a source distrib (haven’t tried binary).

And when inside of a brand new virtual environment, I have a myproject-4.19.tar.gz, file, and I use

(venv) pip install ~/myproject-4.19.tar.gz
...

And other than everything getting installed to my virtual environment’s site-packages, those special data files get installed to /opt/local/myproject/data and /opt/local/myproject/etc.

Question 52

include_package_data=True worked for me.

If you use git, remember to include setuptools-git in install_requires. Far less boring than having a Manifest or including all path in package_data ( in my case it’s a django app with all kind of statics )

( pasted the comment I made, as k3-rnc mentioned it’s actually helpful as is )

Question 53

Update: This answer is old and the information is no longer valid. All setup.py configs should use import setuptools. I’ve added a more complete answer at https://stackoverflow.com/a/49501350/64313

I solved this by switching to distutils. Looks like distribute is deprecated and/or broken.

from distutils.core import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_data={
      'myapp': ['data/*.txt'],
   },
)

Question 54

Ancient question and yet… package management of python really leaves a lot to be desired. So I had the use case of installing using pip locally to a specified directory and was surprised both package_data and data_files paths did not work out. I was not keen on adding yet another file to the repo so I ended up leveraging data_files and setup.py option –install-data; something like this

pip install . --install-option="--install-data=$PWD/package" -t package

Question 55

Moving the folder containing the package data into to module folder solved the problem for me.

See this question: MANIFEST.in ignored on “python setup.py install” – no data files installed?

Question 56

I had the same problem for a couple of days but even this thread wasn’t able to help me as everything was confusing. So I did my research and found the following solution:

Basically in this case, you should do:

from setuptools import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_dir={'myapp':'myapp'}, # the one line where all the magic happens
   package_data={
      'myapp': ['data/*.txt'],
   },
)

The full other stackoverflow answer here

Question 57

Just remove the line:

include_package_data=True,

from your setup script, and it will work fine. (Tested just now with latest setuptools.)

Question 58

Using setup.cfg (setuptools ≥ 30.3.0)

Starting with setuptools 30.3.0 (released 2016-12-08), you can keep your setup.py very small and move the configuration to a setup.cfg file. With this approach, you could put your package data in an [options.package_data] section:

[options.package_data]
* = *.txt, *.rst
hello = *.msg

In this case, your setup.py can be as short as:

from setuptools import setup
setup()

For more information, see configuring setup using setup.cfg files.

There is some talk of deprecating setup.cfg in favour of pyproject.toml as proposed in PEP 518, but this is still provisional as of 2020-02-21.

Question 59

Python recognizes the following as instruction which defines file’s encoding:

# -*- coding: utf-8 -*-

I definitely saw this kind of instructions before (-*- var: value -*-). Where does it come from? What is the full specification, e.g. can the value include spaces, special symbols, newlines, even -*- itself?

My program will be writing plain text files and I’d like to include some metadata in them using this format.

Question 60

This way of specifying the encoding of a Python file comes from PEP 0263 – Defining Python Source Code Encodings.

It is also recognized by GNU Emacs (see Python Language Reference, 2.1.4 Encoding declarations), though I don’t know if it was the first program to use that syntax.

Question 61

# -*- coding: utf-8 -*- is a Python 2 thing. In Python 3+, the default encoding of source files is already UTF-8 and that line is useless.

See: Should I use encoding declaration in Python 3?

pyupgrade is a tool you can run on your code to remove those comments and other no-longer-useful leftovers from Python 2, like having all your classes inherit from object.

Question 62

This is so called file local variables, that are understood by Emacs and set correspondingly. See corresponding section in Emacs manual – you can define them either in header or in footer of file

Question 63

In PyCharm, I’d leave it out. It turns off the UTF-8 indicator at the bottom with a warning that the encoding is hard-coded. Don’t think you need the PyCharm comment mentioned above.

Question 64

I have version 2.7 installed from early 2012. I can’t find any consensus on whether I should completely uninstall and wipe this version before putting on the latest version.

“Soft”-removing old versions? Hard-removing/wiping old versions? Installing over top?

I’ve seen somewhere a special install/upgrade process using a “segmenting” method of Python installations, keeping different versions separate and apart, but functional. Not sure if this is the standard, de facto way.

I also wonder if Revo gets too overzealous and may cause issues with wiping out still-needed remnants, like environment/PATH variables.

(Win7 x64, 32-bit Python)

问题：熊猫将列表的一列分为多列

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何在Python中处理POST和GET变量？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：库中是否提供Python保留字和内建函数的列表？

回答 0

问题：熊猫：如何将一列中的文本分成多行？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：遍历一个numpy数组

回答 0

回答 1

回答 2

问题：如何在Python中将任何数据类型更改为字符串

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：Jenkins中的Python单元测试？

回答 0

样本测试：

sample tests:

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何在setuptools / distribute中包含软件包数据？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

使用setup.cfg（setuptools≥30.3.0）

Using setup.cfg (setuptools ≥ 30.3.0)

问题：这是从哪里来的：-*-编码：utf-8-*-

回答 0

回答 1

回答 2

回答 3

问题：如何更新Python？

回答 0

TL; DR

答案取决于：

TL;DR

The answer depends:

回答 1

回答 2

回答 3

有趣好用的Python教程

问题：这是从哪里来的：--编码：utf-8--