Python 实用宝典

Question 1

I’m using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:

belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
jägermeister rules <br>
rules bands follow performing jägermeister stage <br>
approach

Now the demo code I’m trying to run is this:

import sys

import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem

stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))

@decorators.memoize
def normalize_word(word):
    return stemmer_func(word.lower())

def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

@decorators.memoize
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

if __name__ == '__main__':

    filename = 'example.txt'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    with open(filename) as title_file:

        job_titles = [line.strip() for line in title_file.readlines()]

        words = get_words(job_titles)

        # cluster = KMeansClusterer(5, euclidean_distance)
        cluster = GAAClusterer(5)
        cluster.cluster([vectorspaced(title) for title in job_titles if title])

        # NOTE: This is inefficient, cluster.classify should really just be
        # called when you are classifying previously unseen examples!
        classified_examples = [
                cluster.classify(vectorspaced(title)) for title in job_titles
            ]

        for cluster_id, title in sorted(zip(classified_examples, job_titles)):
            print cluster_id, title

(which can also be found here)

The error I receive is this:

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

What is happening here?

Question 2

The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:

job_titles = [line.strip() for line in title_file.readlines()]

to explicitly decode the strs to unicode (here assuming UTF-8):

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

It could also be solved by importing the codecs module and using codecs.open rather than the built-in open.

Question 3

This works fine for me.

f = open(file_path, 'r+', encoding="utf-8")

You can add a third parameter encoding to ensure the encoding type is ‘utf-8’

Note: this method works fine in Python3, I did not try it in Python2.7.

Question 4

For me there was a problem with the terminal encoding. Adding UTF-8 to .bashrc solved the problem:

export LC_CTYPE=en_US.UTF-8

Don’t forget to reload .bashrc afterwards:

source ~/.bashrc

Question 5

You can try this also:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Question 6

When on Ubuntu 18.04 using Python3.6 I have solved the problem doing both:

with open(filename, encoding="utf-8") as lines:

and if you are running the tool as command line:

export LC_ALL=C.UTF-8

Note that if you are in Python2.7 you have do to handle this differently. First you have to set the default encoding:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

and then to load the file you must use io.open to set the encoding:

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

You still need to export the env

export LC_ALL=C.UTF-8

Question 7

I got this error when trying to install a python package in a Docker container. For me, the issue was that the docker image did not have a locale configured. Adding the following code to the Dockerfile solved the problem for me.

# Avoid ascii errors when reading files in Python
RUN apt-get install -y locales && locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

Question 8

To find ANY and ALL unicode error related… Using the following command:

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

Found mine in

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as

Using shed, I found the offending sequence. It turned out to be an editor mistake.

00008099:     C2  194 302 11000010
00008100:     A0  160 240 10100000
00008101:  d  64  100 144 01100100
00008102:  e  65  101 145 01100101
00008103:  f  66  102 146 01100110
00008104:  a  61  097 141 01100001
00008105:  u  75  117 165 01110101
00008106:  l  6C  108 154 01101100
00008107:  t  74  116 164 01110100
00008108:  -  2D  045 055 00101101
00008109:  s  73  115 163 01110011
00008110:  r  72  114 162 01110010
00008111:  c  63  099 143 01100011
00008112:     C2  194 302 11000010
00008113:     A0  160 240 10100000

Question 9

You can try this before using job_titles string:

source = unicode(job_titles, 'utf-8')

Question 10

For python 3, the default encoding would be “utf-8”. Following steps are suggested in the base documentation:https://docs.python.org/2/library/csv.html#csv-examples in case of any problem

Create a function

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

Then use the function inside the reader, for e.g.

csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))

Question 11

python3x or higher

load file in byte stream:

body = ” for lines in open(‘website/index.html’,’rb’): decodedLine = lines.decode(‘utf-8’) body = body+decodedLine.strip() return body
use global setting:

import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=’utf-8′)

Question 12

Use open(fn, 'rb').read().decode('utf-8') instead of just open(fn).read()

Question 13

Given a function object, how can I get its signature? For example, for:

def myMethod(firt, second, third='something'):
    pass

I would like to get "myMethod(firt, second, third='something')".

Question 14

import inspect

def foo(a, b, x='blah'):
    pass

print(inspect.getargspec(foo))
# ArgSpec(args=['a', 'b', 'x'], varargs=None, keywords=None, defaults=('blah',))

However, note that inspect.getargspec() is deprecated since Python 3.0.

Python 3.0–3.4 recommends inspect.getfullargspec().

Python 3.5+ recommends inspect.signature().

Question 15

Arguably the easiest way to find the signature for a function would be help(function):

>>> def function(arg1, arg2="foo", *args, **kwargs): pass
>>> help(function)
Help on function function in module __main__:

function(arg1, arg2='foo', *args, **kwargs)

Also, in Python 3 a method was added to the inspect module called signature, which is designed to represent the signature of a callable object and its return annotation:

>>> from inspect import signature
>>> def foo(a, *, b:int, **kwargs):
...     pass

>>> sig = signature(foo)

>>> str(sig)
'(a, *, b:int, **kwargs)'

>>> str(sig.parameters['b'])
'b:int'

>>> sig.parameters['b'].annotation
<class 'int'>

Question 16

#! /usr/bin/env python

import inspect
from collections import namedtuple

DefaultArgSpec = namedtuple('DefaultArgSpec', 'has_default default_value')

def _get_default_arg(args, defaults, arg_index):
    """ Method that determines if an argument has default value or not,
    and if yes what is the default value for the argument

    :param args: array of arguments, eg: ['first_arg', 'second_arg', 'third_arg']
    :param defaults: array of default values, eg: (42, 'something')
    :param arg_index: index of the argument in the argument array for which,
    this function checks if a default value exists or not. And if default value
    exists it would return the default value. Example argument: 1
    :return: Tuple of whether there is a default or not, and if yes the default
    value, eg: for index 2 i.e. for "second_arg" this function returns (True, 42)
    """
    if not defaults:
        return DefaultArgSpec(False, None)

    args_with_no_defaults = len(args) - len(defaults)

    if arg_index < args_with_no_defaults:
        return DefaultArgSpec(False, None)
    else:
        value = defaults[arg_index - args_with_no_defaults]
        if (type(value) is str):
            value = '"%s"' % value
        return DefaultArgSpec(True, value)

def get_method_sig(method):
    """ Given a function, it returns a string that pretty much looks how the
    function signature would be written in python.

    :param method: a python method
    :return: A string similar describing the pythong method signature.
    eg: "my_method(first_argArg, second_arg=42, third_arg='something')"
    """

    # The return value of ArgSpec is a bit weird, as the list of arguments and
    # list of defaults are returned in separate array.
    # eg: ArgSpec(args=['first_arg', 'second_arg', 'third_arg'],
    # varargs=None, keywords=None, defaults=(42, 'something'))
    argspec = inspect.getargspec(method)
    arg_index=0
    args = []

    # Use the args and defaults array returned by argspec and find out
    # which arguments has default
    for arg in argspec.args:
        default_arg = _get_default_arg(argspec.args, argspec.defaults, arg_index)
        if default_arg.has_default:
            args.append("%s=%s" % (arg, default_arg.default_value))
        else:
            args.append(arg)
        arg_index += 1
    return "%s(%s)" % (method.__name__, ", ".join(args))


if __name__ == '__main__':
    def my_method(first_arg, second_arg=42, third_arg='something'):
        pass

    print get_method_sig(my_method)
    # my_method(first_argArg, second_arg=42, third_arg="something")

Question 17

Try calling help on an object to find out about it.

>>> foo = [1, 2, 3]
>>> help(foo.append)
Help on built-in function append:

append(...)
    L.append(object) -- append object to end

Question 18

Maybe a bit late to the party, but if you also want to keep the order of the arguments and their defaults, then you can use the Abstract Syntax Tree module (ast).

Here’s a proof of concept (beware the code to sort the arguments and match them to their defaults can definitely be improved/made more clear):

import ast

for class_ in [c for c in module.body if isinstance(c, ast.ClassDef)]:
    for method in [m for m in class_.body if isinstance(m, ast.FunctionDef)]:
        args = []
        if method.args.args:
            [args.append([a.col_offset, a.id]) for a in method.args.args]
        if method.args.defaults:
            [args.append([a.col_offset, '=' + a.id]) for a in method.args.defaults]
        sorted_args = sorted(args)
        for i, p in enumerate(sorted_args):
            if p[1].startswith('='):
                sorted_args[i-1][1] += p[1]
        sorted_args = [k[1] for k in sorted_args if not k[1].startswith('=')]

        if method.args.vararg:
            sorted_args.append('*' + method.args.vararg)
        if method.args.kwarg:
            sorted_args.append('**' + method.args.kwarg)

        signature = '(' + ', '.join(sorted_args) + ')'

        print method.name + signature

Question 19

If all you’re trying to do is print the function then use pydoc.

import pydoc    

def foo(arg1, arg2, *args, **kwargs):                                                                    
    '''Some foo fn'''                                                                                    
    pass                                                                                                 

>>> print pydoc.render_doc(foo).splitlines()[2]
foo(arg1, arg2, *args, **kwargs)

If you’re trying to actually analyze the function signature then use argspec of the inspection module. I had to do that when validating a user’s hook script function into a general framework.

Question 20

Example code:

import inspect
from collections import OrderedDict


def get_signature(fn):
    params = inspect.signature(fn).parameters
    args = []
    kwargs = OrderedDict()
    for p in params.values():
        if p.default is p.empty:
            args.append(p.name)
        else:
            kwargs[p.name] = p.default
    return args, kwargs


def test_sig():
    def fn(a, b, c, d=3, e="abc"):
        pass

    assert get_signature(fn) == (
        ["a", "b", "c"], OrderedDict([("d", 3), ("e", "abc")])
    )

Question 21

Use %pdef in the command line (IPython), it will print only the signature.

e.g. %pdef np.loadtxt

 np.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes')

Question 22

I have found that both of the following work:

class Foo():
    def a(self):
        print "hello"

class Foo(object):
    def a(self):
        print "hello"

Should all Python classes extend object? Are there any potential problems with not extending object?

Question 23

In Python 2, not inheriting from object will create an old-style class, which, amongst other effects, causes type to give different results:

>>> class Foo: pass
... 
>>> type(Foo())
<type 'instance'>

vs.

>>> class Bar(object): pass
... 
>>> type(Bar())
<class '__main__.Bar'>

Also the rules for multiple inheritance are different in ways that I won’t even try to summarize here. All good documentation that I’ve seen about MI describes new-style classes.

Finally, old-style classes have disappeared in Python 3, and inheritance from object has become implicit. So, always prefer new style classes unless you need backward compat with old software.

Question 24

In Python 3, classes extend object implicitly, whether you say so yourself or not.

In Python 2, there’s old-style and new-style classes. To signal a class is new-style, you have to inherit explicitly from object. If not, the old-style implementation is used.

You generally want a new-style class. Inherit from object explicitly. Note that this also applies to Python 3 code that aims to be compatible with Python 2.

Question 25

In python 3 you can create a class in three different ways & internally they are all equal (see examples). It doesn’t matter how you create a class, all classes in python 3 inherits from special class called object. The class object is fundamental class in python and provides lot of functionality like double-underscore methods, descriptors, super() method, property() method etc.

Example 1.

class MyClass:
 pass

Example 2.

class MyClass():
 pass

Example 3.

class MyClass(object):
  pass

Question 26

Yes, all Python classes should extend (or rather subclass, this is Python here) object. While normally no serious problems will occur, in some cases (as with multiple inheritance trees) this will be important. This also ensures better compatibility with Python 3.

Question 27

As other answers have covered, Python 3 inheritance from object is implicit. But they do not state what you should do and what is convention.

The Python 3 documentation examples all use the following style which is convention, so I suggest you follow this for any future code in Python 3.

class Foo:
    pass

Source: https://docs.python.org/3/tutorial/classes.html#class-objects

Example quote:

Class objects support two kinds of operations: attribute references and instantiation.

Attribute references use the standard syntax used for all attribute references in Python: obj.name. Valid attribute names are all the names that were in the class’s namespace when the class object was created. So, if the class definition looked like this:
class MyClass:
    """A simple example class"""
    i = 12345

    def f(self):
        return 'hello world'

Another quote:

Generally speaking, instance variables are for data unique to each instance and class variables are for attributes and methods shared by all instances of the class:
class Dog:

    kind = 'canine'         # class variable shared by all instances

    def __init__(self, name):
        self.name = name    # instance variable unique to each instance

Question 28

in python3 there isn’t a differance, but in python2 not extending object gives you an old-style classes; you’d like to use a new-style class over an old-style class.

Question 29

Is there a way to make a defaultdict also be the default for the defaultdict? (i.e. infinite-level recursive defaultdict?)

I want to be able to do:

x = defaultdict(...stuff...)
x[0][1][0]
{}

So, I can do x = defaultdict(defaultdict), but that’s only a second level:

x[0]
{}
x[0][0]
KeyError: 0

There are recipes that can do this. But can it be done simply just using the normal defaultdict arguments?

Note this is asking how to do an infinite-level recursive defaultdict, so it’s distinct to Python: defaultdict of defaultdict?, which was how to do a two-level defaultdict.

I’ll probably just end up using the bunch pattern, but when I realized I didn’t know how to do this, it got me interested.

Question 30

For an arbitrary number of levels:

def rec_dd():
    return defaultdict(rec_dd)

>>> x = rec_dd()
>>> x['a']['b']['c']['d']
defaultdict(<function rec_dd at 0x7f0dcef81500>, {})
>>> print json.dumps(x)
{"a": {"b": {"c": {"d": {}}}}}

Of course you could also do this with a lambda, but I find lambdas to be less readable. In any case it would look like this:

rec_dd = lambda: defaultdict(rec_dd)

Question 31

The other answers here tell you how to create a defaultdict which contains “infinitely many” defaultdict, but they fail to address what I think may have been your initial need which was to simply have a two-depth defaultdict.

You may have been looking for:

defaultdict(lambda: defaultdict(dict))

The reasons why you might prefer this construct are:

It is more explicit than the recursive solution, and therefore likely more understandable to the reader.
This enables the “leaf” of the defaultdict to be something other than a dictionary, e.g.,: defaultdict(lambda: defaultdict(list)) or defaultdict(lambda: defaultdict(set))

Question 32

There is a nifty trick for doing that:

tree = lambda: defaultdict(tree)

Then you can create your x with x = tree().

Question 33

Similar to BrenBarn’s solution, but doesn’t contain the name of the variable tree twice, so it works even after changes to the variable dictionary:

tree = (lambda f: f(f))(lambda a: (lambda: defaultdict(a(a))))

Then you can create each new x with x = tree().

For the def version, we can use function closure scope to protect the data structure from the flaw where existing instances stop working if the tree name is rebound. It looks like this:

from collections import defaultdict

def tree():
    def the_tree():
        return defaultdict(the_tree)
    return the_tree()

Question 34

I would also propose more OOP-styled implementation, which supports infinite nesting as well as properly formatted repr.

class NestedDefaultDict(defaultdict):
    def __init__(self, *args, **kwargs):
        super(NestedDefaultDict, self).__init__(NestedDefaultDict, *args, **kwargs)

    def __repr__(self):
        return repr(dict(self))

Usage:

my_dict = NestedDefaultDict()
my_dict['a']['b'] = 1
my_dict['a']['c']['d'] = 2
my_dict['b']

print(my_dict)  # {'a': {'b': 1, 'c': {'d': 2}}, 'b': {}}

Question 35

here is a recursive function to convert a recursive default dict to a normal dict

def defdict_to_dict(defdict, finaldict):
    # pass in an empty dict for finaldict
    for k, v in defdict.items():
        if isinstance(v, defaultdict):
            # new level created and that is the new value
            finaldict[k] = defdict_to_dict(v, {})
        else:
            finaldict[k] = v
    return finaldict

defdict_to_dict(my_rec_default_dict, {})

Question 36

I based this of Andrew’s answer here. If you are looking to load data from a json or an existing dict into the nester defaultdict see this example:

def nested_defaultdict(existing=None, **kwargs):
    if existing is None:
        existing = {}
    if not isinstance(existing, dict):
        return existing
    existing = {key: nested_defaultdict(val) for key, val in existing.items()}
    return defaultdict(nested_defaultdict, existing, **kwargs)

https://gist.github.com/nucklehead/2d29628bb49115f3c30e78c071207775

Question 37

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!

Question 38

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

Question 39

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.

`.str.replace`

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don’t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

`.str.extract`

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.

`.str.split` and `.str.get`

Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.

If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.

Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.

Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

Question 40

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

Question 41

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

Question 42

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

Question 43

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

Question 44

I often use list comprehensions for these types of tasks because they’re often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

Question 45

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

Question 46

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

Question 47

What is the best way to get a list of all files in a directory, sorted by date [created | modified], using python, on a windows machine?

Question 48

Update: to sort dirpath‘s entries by modification date in Python 3:

import os
from pathlib import Path

paths = sorted(Path(dirpath).iterdir(), key=os.path.getmtime)

_{(put @Pygirl’s answer here for greater visibility)}

If you already have a list of filenames files, then to sort it inplace by creation time on Windows:

files.sort(key=os.path.getctime)

The list of files you could get, for example, using glob as shown in @Jay’s answer.

^{old answer} Here’s a more verbose version of @Greg Hewgill‘s answer. It is the most conforming to the question requirements. It makes a distinction between creation and modification dates (at least on Windows).

#!/usr/bin/env python
from stat import S_ISREG, ST_CTIME, ST_MODE
import os, sys, time

# path to the directory (relative or absolute)
dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

# get all entries in the directory w/ stats
entries = (os.path.join(dirpath, fn) for fn in os.listdir(dirpath))
entries = ((os.stat(path), path) for path in entries)

# leave only regular files, insert creation date
entries = ((stat[ST_CTIME], path)
           for stat, path in entries if S_ISREG(stat[ST_MODE]))
#NOTE: on Windows `ST_CTIME` is a creation date 
#  but on Unix it could be something else
#NOTE: use `ST_MTIME` to sort by a modification date

for cdate, path in sorted(entries):
    print time.ctime(cdate), os.path.basename(path)

Example:

$ python stat_creation_date.py
Thu Feb 11 13:31:07 2009 stat_creation_date.py

Question 49

I’ve done this in the past for a Python script to determine the last updated files in a directory:

import glob
import os

search_dir = "/mydir/"
# remove anything from the list that is not a file (directories, symlinks)
# thanks to J.F. Sebastion for pointing out that the requirement was a list 
# of files (presumably not including directories)  
files = list(filter(os.path.isfile, glob.glob(search_dir + "*")))
files.sort(key=lambda x: os.path.getmtime(x))

That should do what you’re looking for based on file mtime.

EDIT: Note that you can also use os.listdir() in place of glob.glob() if desired – the reason I used glob in my original code was that I was wanting to use glob to only search for files with a particular set of file extensions, which glob() was better suited to. To use listdir here’s what it would look like:

import os

search_dir = "/mydir/"
os.chdir(search_dir)
files = filter(os.path.isfile, os.listdir(search_dir))
files = [os.path.join(search_dir, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))

Question 50

There is an os.path.getmtime function that gives the number of seconds since the epoch and should be faster than os.stat.

import os 

os.chdir(directory)
sorted(filter(os.path.isfile, os.listdir('.')), key=os.path.getmtime)

Question 51

Here’s my version:

def getfiles(dirpath):
    a = [s for s in os.listdir(dirpath)
         if os.path.isfile(os.path.join(dirpath, s))]
    a.sort(key=lambda s: os.path.getmtime(os.path.join(dirpath, s)))
    return a

First, we build a list of the file names. isfile() is used to skip directories; it can be omitted if directories should be included. Then, we sort the list in-place, using the modify date as the key.

Question 52

Here’s a one-liner:

import os
import time
from pprint import pprint

pprint([(x[0], time.ctime(x[1].st_ctime)) for x in sorted([(fn, os.stat(fn)) for fn in os.listdir(".")], key = lambda x: x[1].st_ctime)])

This calls os.listdir() to get a list of the filenames, then calls os.stat() for each one to get the creation time, then sorts against the creation time.

Note that this method only calls os.stat() once for each file, which will be more efficient than calling it for each comparison in a sort.

Question 53

Without changing directory:

import os    

path = '/path/to/files/'
name_list = os.listdir(path)
full_list = [os.path.join(path,i) for i in name_list]
time_sorted_list = sorted(full_list, key=os.path.getmtime)

print time_sorted_list

# if you want just the filenames sorted, simply remove the dir from each
sorted_filename_list = [ os.path.basename(i) for i in time_sorted_list]
print sorted_filename_list

Question 54

In python 3.5+

from pathlib import Path
sorted(Path('.').iterdir(), key=lambda f: f.stat().st_mtime)

Question 55

Here’s my answer using glob without filter if you want to read files with a certain extension in date order (Python 3).

dataset_path='/mydir/'   
files = glob.glob(dataset_path+"/morepath/*.extension")   
files.sort(key=os.path.getmtime)

Question 56

# *** the shortest and best way ***
# getmtime --> sort by modified time
# getctime --> sort by created time

import glob,os

lst_files = glob.glob("*.txt")
lst_files.sort(key=os.path.getmtime)
print("\n".join(lst_files))

Question 57

sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.stat(p).st_mtime)

You could use os.walk('.').next()[-1] instead of filtering with os.path.isfile, but that leaves dead symlinks in the list, and os.stat will fail on them.

Question 58

from pathlib import Path
import os

sorted(Path('./').iterdir(), key=lambda t: t.stat().st_mtime)

or

sorted(Path('./').iterdir(), key=os.path.getmtime)

or

sorted(os.scandir('./'), key=lambda t: t.stat().st_mtime)

where m time is modified time.

Question 59

this is a basic step for learn:

import os, stat, sys
import time

dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

listdir = os.listdir(dirpath)

for i in listdir:
    os.chdir(dirpath)
    data_001 = os.path.realpath(i)
    listdir_stat1 = os.stat(data_001)
    listdir_stat2 = ((os.stat(data_001), data_001))
    print time.ctime(listdir_stat1.st_ctime), data_001

Question 60

Alex Coventry’s answer will produce an exception if the file is a symlink to an unexistent file, the following code corrects that answer:

import time
import datetime
sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.path.exists(p) and os.stat(p).st_mtime or time.mktime(datetime.now().timetuple())

When the file doesn’t exist, now() is used, and the symlink will go at the very end of the list.

Question 61

Here is a simple couple lines that looks for extention as well as provides a sort option

def get_sorted_files(src_dir, regex_ext='*', sort_reverse=False): 
    files_to_evaluate = [os.path.join(src_dir, f) for f in os.listdir(src_dir) if re.search(r'.*\.({})$'.format(regex_ext), f)]
    files_to_evaluate.sort(key=os.path.getmtime, reverse=sort_reverse)
    return files_to_evaluate

Question 62

For completeness with os.scandir (2x faster over pathlib):

import os
sorted(os.scandir('/tmp/test'), key=lambda d: d.stat().st_mtime)

Question 63

This was my version:

import os

folder_path = r'D:\Movies\extra\new\dramas' # your path
os.chdir(folder_path) # make the path active
x = sorted(os.listdir(), key=os.path.getctime)  # sorted using creation time

folder = 0

for folder in range(len(x)):
    print(x[folder]) # print all the foldername inside the folder_path
    folder = +1

Question 64

Maybe you should use shell commands. In Unix/Linux, find piped with sort will probably be able to do what you want.

问题：UnicodeDecodeError：’ascii’编解码器无法解码位置13的字节0xe2：序数不在范围内（128）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：如何读取包含默认参数值的函数签名？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：所有Python类都应该扩展对象吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：嵌套的defaultdict defaultdict

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：从列中的字符串中删除不需要的部分

回答 0

回答 1

如何从列的字符串中删除不需要的部分？

.str.split 和 .str.get

优化：列表理解

性能比较

How do I remove unwanted parts from strings in a column?

.str.split and .str.get

Optimizing: List Comprehensions

Performance Comparison

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：如何获取按python创建日期排序的目录列表？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

问题：从列表中随机选择50个项目写入文件

回答 0

回答 1

回答 2

回答 3

问题：初始化一个numpy数组

回答 0

回答 1

`.str.split` 和 `.str.get`

`.str.split` and `.str.get`