标签归档:Python

UnicodeDecodeError:’ascii’编解码器无法解码位置13的字节0xe2:序数不在范围内(128)

问题:UnicodeDecodeError:’ascii’编解码器无法解码位置13的字节0xe2:序数不在范围内(128)

我正在使用NLTK在我的文本文件中执行kmeans聚类,其中每一行都被视为文档。例如,我的文本文件是这样的:

belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
jägermeister rules <br>
rules bands follow performing jägermeister stage <br>
approach 

现在我要运行的演示代码是这样的:

import sys

import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem

stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))

@decorators.memoize
def normalize_word(word):
    return stemmer_func(word.lower())

def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

@decorators.memoize
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

if __name__ == '__main__':

    filename = 'example.txt'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    with open(filename) as title_file:

        job_titles = [line.strip() for line in title_file.readlines()]

        words = get_words(job_titles)

        # cluster = KMeansClusterer(5, euclidean_distance)
        cluster = GAAClusterer(5)
        cluster.cluster([vectorspaced(title) for title in job_titles if title])

        # NOTE: This is inefficient, cluster.classify should really just be
        # called when you are classifying previously unseen examples!
        classified_examples = [
                cluster.classify(vectorspaced(title)) for title in job_titles
            ]

        for cluster_id, title in sorted(zip(classified_examples, job_titles)):
            print cluster_id, title

(也可以在这里找到)

我收到的错误是这样的:

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

这是怎么回事

I’m using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:

belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
jägermeister rules <br>
rules bands follow performing jägermeister stage <br>
approach 

Now the demo code I’m trying to run is this:

import sys

import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem

stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))

@decorators.memoize
def normalize_word(word):
    return stemmer_func(word.lower())

def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

@decorators.memoize
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

if __name__ == '__main__':

    filename = 'example.txt'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    with open(filename) as title_file:

        job_titles = [line.strip() for line in title_file.readlines()]

        words = get_words(job_titles)

        # cluster = KMeansClusterer(5, euclidean_distance)
        cluster = GAAClusterer(5)
        cluster.cluster([vectorspaced(title) for title in job_titles if title])

        # NOTE: This is inefficient, cluster.classify should really just be
        # called when you are classifying previously unseen examples!
        classified_examples = [
                cluster.classify(vectorspaced(title)) for title in job_titles
            ]

        for cluster_id, title in sorted(zip(classified_examples, job_titles)):
            print cluster_id, title

(which can also be found here)

The error I receive is this:

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

What is happening here?


回答 0

该文件被读为一堆str,但应该为unicode。Python尝试隐式转换,但失败。更改:

job_titles = [line.strip() for line in title_file.readlines()]

strs 显式解码为unicode(此处假定为UTF-8):

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

它也可以通过进口解决codecs模块和使用codecs.open,而不是内置的open

The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:

job_titles = [line.strip() for line in title_file.readlines()]

to explicitly decode the strs to unicode (here assuming UTF-8):

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

It could also be solved by importing the codecs module and using codecs.open rather than the built-in open.


回答 1

这对我来说很好。

f = open(file_path, 'r+', encoding="utf-8")

您可以添加第三个参数编码,以确保编码类型为’utf-8′

注意:此方法在Python3中工作正常,我没有在Python2.7中尝试过。

This works fine for me.

f = open(file_path, 'r+', encoding="utf-8")

You can add a third parameter encoding to ensure the encoding type is ‘utf-8’

Note: this method works fine in Python3, I did not try it in Python2.7.


回答 2

对我来说,终端编码有问题。将UTF-8添加到.bashrc解决了该问题:

export LC_CTYPE=en_US.UTF-8

不要忘了之后重新加载.bashrc:

source ~/.bashrc

For me there was a problem with the terminal encoding. Adding UTF-8 to .bashrc solved the problem:

export LC_CTYPE=en_US.UTF-8

Don’t forget to reload .bashrc afterwards:

source ~/.bashrc

回答 3

您也可以尝试以下操作:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

You can try this also:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

回答 4

在使用Python3.6的 Ubuntu 18.04上,我同时解决了以下问题:

with open(filename, encoding="utf-8") as lines:

并且如果您以命令行方式运行该工具:

export LC_ALL=C.UTF-8

请注意,如果您使用的是Python2.7,则必须以不同的方式进行处理。首先,您必须设置默认编码:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

然后要加载文件,您必须使用它io.open来设置编码:

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

您仍然需要导出环境

export LC_ALL=C.UTF-8

When on Ubuntu 18.04 using Python3.6 I have solved the problem doing both:

with open(filename, encoding="utf-8") as lines:

and if you are running the tool as command line:

export LC_ALL=C.UTF-8

Note that if you are in Python2.7 you have do to handle this differently. First you have to set the default encoding:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

and then to load the file you must use io.open to set the encoding:

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

You still need to export the env

export LC_ALL=C.UTF-8

回答 5

尝试在Docker容器中安装python软件包时出现此错误。对我来说,问题是Docker映像没有locale配置。将以下代码添加到Dockerfile中为我解决了这个问题。

# Avoid ascii errors when reading files in Python
RUN apt-get install -y \
  locales && \
  locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

I got this error when trying to install a python package in a Docker container. For me, the issue was that the docker image did not have a locale configured. Adding the following code to the Dockerfile solved the problem for me.

# Avoid ascii errors when reading files in Python
RUN apt-get install -y locales && locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

回答 6

要查找与ANY和ALL unicode错误相关的信息,请使用以下命令:

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

在发现我的

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as 

使用shed,我发现了令人讨厌的顺序。原来是编辑器错误。

00008099:     C2  194 302 11000010
00008100:     A0  160 240 10100000
00008101:  d  64  100 144 01100100
00008102:  e  65  101 145 01100101
00008103:  f  66  102 146 01100110
00008104:  a  61  097 141 01100001
00008105:  u  75  117 165 01110101
00008106:  l  6C  108 154 01101100
00008107:  t  74  116 164 01110100
00008108:  -  2D  045 055 00101101
00008109:  s  73  115 163 01110011
00008110:  r  72  114 162 01110010
00008111:  c  63  099 143 01100011
00008112:     C2  194 302 11000010
00008113:     A0  160 240 10100000

To find ANY and ALL unicode error related… Using the following command:

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

Found mine in

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as 

Using shed, I found the offending sequence. It turned out to be an editor mistake.

00008099:     C2  194 302 11000010
00008100:     A0  160 240 10100000
00008101:  d  64  100 144 01100100
00008102:  e  65  101 145 01100101
00008103:  f  66  102 146 01100110
00008104:  a  61  097 141 01100001
00008105:  u  75  117 165 01110101
00008106:  l  6C  108 154 01101100
00008107:  t  74  116 164 01110100
00008108:  -  2D  045 055 00101101
00008109:  s  73  115 163 01110011
00008110:  r  72  114 162 01110010
00008111:  c  63  099 143 01100011
00008112:     C2  194 302 11000010
00008113:     A0  160 240 10100000

回答 7

您可以在使用job_titles字符串之前尝试以下操作:

source = unicode(job_titles, 'utf-8')

You can try this before using job_titles string:

source = unicode(job_titles, 'utf-8')

回答 8

对于python 3,默认编码为“ utf-8”。基本文档中建议采取以下步骤:如有任何问题,请https://docs.python.org/2/library/csv.html#csv-examples

  1. 创建一个功能

    def utf_8_encoder(unicode_csv_data):
        for line in unicode_csv_data:
            yield line.encode('utf-8')
  2. 然后使用阅读器内部的功能,例如

    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))

For python 3, the default encoding would be “utf-8”. Following steps are suggested in the base documentation:https://docs.python.org/2/library/csv.html#csv-examples in case of any problem

  1. Create a function

    def utf_8_encoder(unicode_csv_data):
        for line in unicode_csv_data:
            yield line.encode('utf-8')
    
  2. Then use the function inside the reader, for e.g.

    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))
    

回答 9

python3x或更高版本

  1. 以字节流加载文件:

    body =”for open(’website / index.html’,’rb’)中的行:

  2. 使用全局设置:

    导入io导入sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding =’utf-8’)

python3x or higher

  1. load file in byte stream:

    body = ” for lines in open(‘website/index.html’,’rb’): decodedLine = lines.decode(‘utf-8’) body = body+decodedLine.strip() return body

  2. use global setting:

    import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=’utf-8′)


回答 10

使用open(fn, 'rb').read().decode('utf-8')而不是仅仅open(fn).read()

Use open(fn, 'rb').read().decode('utf-8') instead of just open(fn).read()


如何读取包含默认参数值的函数签名?

问题:如何读取包含默认参数值的函数签名?

给定一个功能对象,如何获得其签名?例如,用于:

def myMethod(firt, second, third='something'):
    pass

我想得到"myMethod(firt, second, third='something')"

Given a function object, how can I get its signature? For example, for:

def myMethod(firt, second, third='something'):
    pass

I would like to get "myMethod(firt, second, third='something')".


回答 0

import inspect

def foo(a, b, x='blah'):
    pass

print(inspect.getargspec(foo))
# ArgSpec(args=['a', 'b', 'x'], varargs=None, keywords=None, defaults=('blah',))

但是,请注意,inspect.getargspec()自Python 3.0开始不推荐使用。

Python 3.0–3.4建议inspect.getfullargspec()

Python 3.5+建议inspect.signature()

import inspect

def foo(a, b, x='blah'):
    pass

print(inspect.getargspec(foo))
# ArgSpec(args=['a', 'b', 'x'], varargs=None, keywords=None, defaults=('blah',))

However, note that inspect.getargspec() is deprecated since Python 3.0.

Python 3.0–3.4 recommends inspect.getfullargspec().

Python 3.5+ recommends inspect.signature().


回答 1

可以说,找到函数签名的最简单方法是help(function)

>>> def function(arg1, arg2="foo", *args, **kwargs): pass
>>> help(function)
Help on function function in module __main__:

function(arg1, arg2='foo', *args, **kwargs)

另外,在Python 3中,将一个方法添加到inspect名为的模块中signature,该方法旨在表示可调用对象签名及其返回注释

>>> from inspect import signature
>>> def foo(a, *, b:int, **kwargs):
...     pass

>>> sig = signature(foo)

>>> str(sig)
'(a, *, b:int, **kwargs)'

>>> str(sig.parameters['b'])
'b:int'

>>> sig.parameters['b'].annotation
<class 'int'>

Arguably the easiest way to find the signature for a function would be help(function):

>>> def function(arg1, arg2="foo", *args, **kwargs): pass
>>> help(function)
Help on function function in module __main__:

function(arg1, arg2='foo', *args, **kwargs)

Also, in Python 3 a method was added to the inspect module called signature, which is designed to represent the signature of a callable object and its return annotation:

>>> from inspect import signature
>>> def foo(a, *, b:int, **kwargs):
...     pass

>>> sig = signature(foo)

>>> str(sig)
'(a, *, b:int, **kwargs)'

>>> str(sig.parameters['b'])
'b:int'

>>> sig.parameters['b'].annotation
<class 'int'>

回答 2

#! /usr/bin/env python

import inspect
from collections import namedtuple

DefaultArgSpec = namedtuple('DefaultArgSpec', 'has_default default_value')

def _get_default_arg(args, defaults, arg_index):
    """ Method that determines if an argument has default value or not,
    and if yes what is the default value for the argument

    :param args: array of arguments, eg: ['first_arg', 'second_arg', 'third_arg']
    :param defaults: array of default values, eg: (42, 'something')
    :param arg_index: index of the argument in the argument array for which,
    this function checks if a default value exists or not. And if default value
    exists it would return the default value. Example argument: 1
    :return: Tuple of whether there is a default or not, and if yes the default
    value, eg: for index 2 i.e. for "second_arg" this function returns (True, 42)
    """
    if not defaults:
        return DefaultArgSpec(False, None)

    args_with_no_defaults = len(args) - len(defaults)

    if arg_index < args_with_no_defaults:
        return DefaultArgSpec(False, None)
    else:
        value = defaults[arg_index - args_with_no_defaults]
        if (type(value) is str):
            value = '"%s"' % value
        return DefaultArgSpec(True, value)

def get_method_sig(method):
    """ Given a function, it returns a string that pretty much looks how the
    function signature would be written in python.

    :param method: a python method
    :return: A string similar describing the pythong method signature.
    eg: "my_method(first_argArg, second_arg=42, third_arg='something')"
    """

    # The return value of ArgSpec is a bit weird, as the list of arguments and
    # list of defaults are returned in separate array.
    # eg: ArgSpec(args=['first_arg', 'second_arg', 'third_arg'],
    # varargs=None, keywords=None, defaults=(42, 'something'))
    argspec = inspect.getargspec(method)
    arg_index=0
    args = []

    # Use the args and defaults array returned by argspec and find out
    # which arguments has default
    for arg in argspec.args:
        default_arg = _get_default_arg(argspec.args, argspec.defaults, arg_index)
        if default_arg.has_default:
            args.append("%s=%s" % (arg, default_arg.default_value))
        else:
            args.append(arg)
        arg_index += 1
    return "%s(%s)" % (method.__name__, ", ".join(args))


if __name__ == '__main__':
    def my_method(first_arg, second_arg=42, third_arg='something'):
        pass

    print get_method_sig(my_method)
    # my_method(first_argArg, second_arg=42, third_arg="something")
#! /usr/bin/env python

import inspect
from collections import namedtuple

DefaultArgSpec = namedtuple('DefaultArgSpec', 'has_default default_value')

def _get_default_arg(args, defaults, arg_index):
    """ Method that determines if an argument has default value or not,
    and if yes what is the default value for the argument

    :param args: array of arguments, eg: ['first_arg', 'second_arg', 'third_arg']
    :param defaults: array of default values, eg: (42, 'something')
    :param arg_index: index of the argument in the argument array for which,
    this function checks if a default value exists or not. And if default value
    exists it would return the default value. Example argument: 1
    :return: Tuple of whether there is a default or not, and if yes the default
    value, eg: for index 2 i.e. for "second_arg" this function returns (True, 42)
    """
    if not defaults:
        return DefaultArgSpec(False, None)

    args_with_no_defaults = len(args) - len(defaults)

    if arg_index < args_with_no_defaults:
        return DefaultArgSpec(False, None)
    else:
        value = defaults[arg_index - args_with_no_defaults]
        if (type(value) is str):
            value = '"%s"' % value
        return DefaultArgSpec(True, value)

def get_method_sig(method):
    """ Given a function, it returns a string that pretty much looks how the
    function signature would be written in python.

    :param method: a python method
    :return: A string similar describing the pythong method signature.
    eg: "my_method(first_argArg, second_arg=42, third_arg='something')"
    """

    # The return value of ArgSpec is a bit weird, as the list of arguments and
    # list of defaults are returned in separate array.
    # eg: ArgSpec(args=['first_arg', 'second_arg', 'third_arg'],
    # varargs=None, keywords=None, defaults=(42, 'something'))
    argspec = inspect.getargspec(method)
    arg_index=0
    args = []

    # Use the args and defaults array returned by argspec and find out
    # which arguments has default
    for arg in argspec.args:
        default_arg = _get_default_arg(argspec.args, argspec.defaults, arg_index)
        if default_arg.has_default:
            args.append("%s=%s" % (arg, default_arg.default_value))
        else:
            args.append(arg)
        arg_index += 1
    return "%s(%s)" % (method.__name__, ", ".join(args))


if __name__ == '__main__':
    def my_method(first_arg, second_arg=42, third_arg='something'):
        pass

    print get_method_sig(my_method)
    # my_method(first_argArg, second_arg=42, third_arg="something")

回答 3

尝试调用help一个对象以了解它。

>>> foo = [1, 2, 3]
>>> help(foo.append)
Help on built-in function append:

append(...)
    L.append(object) -- append object to end

Try calling help on an object to find out about it.

>>> foo = [1, 2, 3]
>>> help(foo.append)
Help on built-in function append:

append(...)
    L.append(object) -- append object to end

回答 4

可能要晚一些,但是如果您还想保留参数的顺序及其默认值,则可以使用抽象语法树模块(ast)

这是一个概念证明(请注意,对参数进行排序并将其与默认值匹配的代码肯定可以得到改善/更加清晰):

import ast

for class_ in [c for c in module.body if isinstance(c, ast.ClassDef)]:
    for method in [m for m in class_.body if isinstance(m, ast.FunctionDef)]:
        args = []
        if method.args.args:
            [args.append([a.col_offset, a.id]) for a in method.args.args]
        if method.args.defaults:
            [args.append([a.col_offset, '=' + a.id]) for a in method.args.defaults]
        sorted_args = sorted(args)
        for i, p in enumerate(sorted_args):
            if p[1].startswith('='):
                sorted_args[i-1][1] += p[1]
        sorted_args = [k[1] for k in sorted_args if not k[1].startswith('=')]

        if method.args.vararg:
            sorted_args.append('*' + method.args.vararg)
        if method.args.kwarg:
            sorted_args.append('**' + method.args.kwarg)

        signature = '(' + ', '.join(sorted_args) + ')'

        print method.name + signature

Maybe a bit late to the party, but if you also want to keep the order of the arguments and their defaults, then you can use the Abstract Syntax Tree module (ast).

Here’s a proof of concept (beware the code to sort the arguments and match them to their defaults can definitely be improved/made more clear):

import ast

for class_ in [c for c in module.body if isinstance(c, ast.ClassDef)]:
    for method in [m for m in class_.body if isinstance(m, ast.FunctionDef)]:
        args = []
        if method.args.args:
            [args.append([a.col_offset, a.id]) for a in method.args.args]
        if method.args.defaults:
            [args.append([a.col_offset, '=' + a.id]) for a in method.args.defaults]
        sorted_args = sorted(args)
        for i, p in enumerate(sorted_args):
            if p[1].startswith('='):
                sorted_args[i-1][1] += p[1]
        sorted_args = [k[1] for k in sorted_args if not k[1].startswith('=')]

        if method.args.vararg:
            sorted_args.append('*' + method.args.vararg)
        if method.args.kwarg:
            sorted_args.append('**' + method.args.kwarg)

        signature = '(' + ', '.join(sorted_args) + ')'

        print method.name + signature

回答 5

如果您只想打印功能,请使用pydoc。

import pydoc    

def foo(arg1, arg2, *args, **kwargs):                                                                    
    '''Some foo fn'''                                                                                    
    pass                                                                                                 

>>> print pydoc.render_doc(foo).splitlines()[2]
foo(arg1, arg2, *args, **kwargs)

如果您尝试实际分析功能签名,请使用检查模块的argspec。在将用户的挂钩脚本功能验证到通用框架中时,我必须这样做。

If all you’re trying to do is print the function then use pydoc.

import pydoc    

def foo(arg1, arg2, *args, **kwargs):                                                                    
    '''Some foo fn'''                                                                                    
    pass                                                                                                 

>>> print pydoc.render_doc(foo).splitlines()[2]
foo(arg1, arg2, *args, **kwargs)

If you’re trying to actually analyze the function signature then use argspec of the inspection module. I had to do that when validating a user’s hook script function into a general framework.


回答 6

示例代码:

import inspect
from collections import OrderedDict


def get_signature(fn):
    params = inspect.signature(fn).parameters
    args = []
    kwargs = OrderedDict()
    for p in params.values():
        if p.default is p.empty:
            args.append(p.name)
        else:
            kwargs[p.name] = p.default
    return args, kwargs


def test_sig():
    def fn(a, b, c, d=3, e="abc"):
        pass

    assert get_signature(fn) == (
        ["a", "b", "c"], OrderedDict([("d", 3), ("e", "abc")])
    )

Example code:

import inspect
from collections import OrderedDict


def get_signature(fn):
    params = inspect.signature(fn).parameters
    args = []
    kwargs = OrderedDict()
    for p in params.values():
        if p.default is p.empty:
            args.append(p.name)
        else:
            kwargs[p.name] = p.default
    return args, kwargs


def test_sig():
    def fn(a, b, c, d=3, e="abc"):
        pass

    assert get_signature(fn) == (
        ["a", "b", "c"], OrderedDict([("d", 3), ("e", "abc")])
    )

回答 7

在命令行(IPython)中使用%pdef,它将仅打印签名。

例如 %pdef np.loadtxt

 np.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes')

Use %pdef in the command line (IPython), it will print only the signature.

e.g. %pdef np.loadtxt

 np.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes')

所有Python类都应该扩展对象吗?

问题:所有Python类都应该扩展对象吗?

我发现以下两项工作:

class Foo():
    def a(self):
        print "hello"

class Foo(object):
    def a(self):
        print "hello"

所有Python类都应该扩展对象吗?不扩展对象是否存在任何潜在问题?

I have found that both of the following work:

class Foo():
    def a(self):
        print "hello"

class Foo(object):
    def a(self):
        print "hello"

Should all Python classes extend object? Are there any potential problems with not extending object?


回答 0

在Python 2中,不继承自object将创建一个旧式类,除其他效果外,该类还会导致type产生不同的结果:

>>> class Foo: pass
... 
>>> type(Foo())
<type 'instance'>

>>> class Bar(object): pass
... 
>>> type(Bar())
<class '__main__.Bar'>

多重继承的规则也不同在这里我什至不尝试总结。我所见过的有关MI的所有好的文档都描述了新型类。

最终,旧式类在Python 3中消失了,并且继承自object隐式了。因此,除非您需要与旧软件向后兼容,否则请始终偏爱新样式类。

In Python 2, not inheriting from object will create an old-style class, which, amongst other effects, causes type to give different results:

>>> class Foo: pass
... 
>>> type(Foo())
<type 'instance'>

vs.

>>> class Bar(object): pass
... 
>>> type(Bar())
<class '__main__.Bar'>

Also the rules for multiple inheritance are different in ways that I won’t even try to summarize here. All good documentation that I’ve seen about MI describes new-style classes.

Finally, old-style classes have disappeared in Python 3, and inheritance from object has become implicit. So, always prefer new style classes unless you need backward compat with old software.


回答 1

在Python 3中,object无论您是否自己说,类都隐式扩展。

在Python 2中,有旧式和新式类。要发出新的类信号,您必须显式继承自object。如果不是,则使用旧式实现。

通常,您需要一个新式的类。object显式继承。请注意,这也适用于旨在与Python 2兼容的Python 3代码。

In Python 3, classes extend object implicitly, whether you say so yourself or not.

In Python 2, there’s old-style and new-style classes. To signal a class is new-style, you have to inherit explicitly from object. If not, the old-style implementation is used.

You generally want a new-style class. Inherit from object explicitly. Note that this also applies to Python 3 code that aims to be compatible with Python 2.


回答 2

在python 3中,您可以通过三种不同的方式创建一个类,并且在内部它们都相等(请参见示例)。创建类无关紧要,Python 3中的所有类都继承自称为object的特殊类 。类对象 是python中的基础类,并提供许多功能,例如双下划线方法,描述符,super()方法,property()方法等。

范例1。

class MyClass:
 pass

示例2

class MyClass():
 pass

范例3。

class MyClass(object):
  pass

In python 3 you can create a class in three different ways & internally they are all equal (see examples). It doesn’t matter how you create a class, all classes in python 3 inherits from special class called object. The class object is fundamental class in python and provides lot of functionality like double-underscore methods, descriptors, super() method, property() method etc.

Example 1.

class MyClass:
 pass

Example 2.

class MyClass():
 pass

Example 3.

class MyClass(object):
  pass

回答 3

是的,所有Python类都应扩展(或更确切地说是子类,这里是Python)对象。尽管通常不会发生严重的问题,但在某些情况下(如具有多个继承树),这将很重要。这也确保了与Python 3的更好兼容性。

Yes, all Python classes should extend (or rather subclass, this is Python here) object. While normally no serious problems will occur, in some cases (as with multiple inheritance trees) this will be important. This also ensures better compatibility with Python 3.


回答 4

正如其他答案所涵盖的那样,从对象继承Python 3是隐式的。但是他们没有说明您应该做什么以及什么是惯例。

Python 3文档示例全部使用以下约定的样式,因此,我建议您在以后使用Python 3的任何代码中都遵循此样式。

class Foo:
    pass

资料来源:https : //docs.python.org/3/tutorial/classes.html#class-objects

引用示例:

类对象支持两种操作:属性引用和实例化。

属性引用使用Python中所有属性引用使用的标准语法:obj.name。有效属性名称是创建类对象时在类命名空间中的所有名称。因此,如果类定义如下所示:

class MyClass:
    """A simple example class"""
    i = 12345

    def f(self):
        return 'hello world'

另一句话:

一般来说,实例变量用于每个实例唯一的数据,类变量用于类的所有实例共享的属性和方法:

class Dog:

    kind = 'canine'         # class variable shared by all instances

    def __init__(self, name):
        self.name = name    # instance variable unique to each instance

As other answers have covered, Python 3 inheritance from object is implicit. But they do not state what you should do and what is convention.

The Python 3 documentation examples all use the following style which is convention, so I suggest you follow this for any future code in Python 3.

class Foo:
    pass

Source: https://docs.python.org/3/tutorial/classes.html#class-objects

Example quote:

Class objects support two kinds of operations: attribute references and instantiation.

Attribute references use the standard syntax used for all attribute references in Python: obj.name. Valid attribute names are all the names that were in the class’s namespace when the class object was created. So, if the class definition looked like this:

class MyClass:
    """A simple example class"""
    i = 12345

    def f(self):
        return 'hello world'

Another quote:

Generally speaking, instance variables are for data unique to each instance and class variables are for attributes and methods shared by all instances of the class:

class Dog:

    kind = 'canine'         # class variable shared by all instances

    def __init__(self, name):
        self.name = name    # instance variable unique to each instance

回答 5

在python3中没有区别,但是在python2中不扩展object给您带来了老式的类;您想在旧类上使用新类。

in python3 there isn’t a differance, but in python2 not extending object gives you an old-style classes; you’d like to use a new-style class over an old-style class.


嵌套的defaultdict defaultdict

问题:嵌套的defaultdict defaultdict

有没有办法使defaultdict也成为defaultdict的默认值?(即无限级递归defaultdict?)

我希望能够做到:

x = defaultdict(...stuff...)
x[0][1][0]
{}

因此,我可以做到x = defaultdict(defaultdict),但这仅是第二层:

x[0]
{}
x[0][0]
KeyError: 0

有一些食谱可以做到这一点。但是,仅使用常规的defaultdict参数就可以做到吗?

请注意,这是在问如何执行无限级递归defaultdict,因此它与Python不同:defaultdict的defaultdict?,这是执行两级defaultdict的方法。

我可能最终会使用模式,但是当我意识到自己不知道该怎么做时,这引起了我的兴趣。

Is there a way to make a defaultdict also be the default for the defaultdict? (i.e. infinite-level recursive defaultdict?)

I want to be able to do:

x = defaultdict(...stuff...)
x[0][1][0]
{}

So, I can do x = defaultdict(defaultdict), but that’s only a second level:

x[0]
{}
x[0][0]
KeyError: 0

There are recipes that can do this. But can it be done simply just using the normal defaultdict arguments?

Note this is asking how to do an infinite-level recursive defaultdict, so it’s distinct to Python: defaultdict of defaultdict?, which was how to do a two-level defaultdict.

I’ll probably just end up using the bunch pattern, but when I realized I didn’t know how to do this, it got me interested.


回答 0

对于任意数量的级别:

def rec_dd():
    return defaultdict(rec_dd)

>>> x = rec_dd()
>>> x['a']['b']['c']['d']
defaultdict(<function rec_dd at 0x7f0dcef81500>, {})
>>> print json.dumps(x)
{"a": {"b": {"c": {"d": {}}}}}

当然,您也可以使用lambda来执行此操作,但是我发现lambda的可读性较差。无论如何,它看起来像这样:

rec_dd = lambda: defaultdict(rec_dd)

For an arbitrary number of levels:

def rec_dd():
    return defaultdict(rec_dd)

>>> x = rec_dd()
>>> x['a']['b']['c']['d']
defaultdict(<function rec_dd at 0x7f0dcef81500>, {})
>>> print json.dumps(x)
{"a": {"b": {"c": {"d": {}}}}}

Of course you could also do this with a lambda, but I find lambdas to be less readable. In any case it would look like this:

rec_dd = lambda: defaultdict(rec_dd)

回答 1

这里的其他答案告诉您如何创建一个defaultdict包含“无限多个”的defaultdict,但是它们无法解决我认为您最初的需求,即仅具有两个深度的defaultdict。

您可能一直在寻找:

defaultdict(lambda: defaultdict(dict))

您可能更喜欢此构造的原因是:

  • 它比递归解决方案更明确,因此读者可能更容易理解。
  • 这使的“叶” defaultdict可以是除字典之外的其他内容,例如:defaultdict(lambda: defaultdict(list))defaultdict(lambda: defaultdict(set))

The other answers here tell you how to create a defaultdict which contains “infinitely many” defaultdict, but they fail to address what I think may have been your initial need which was to simply have a two-depth defaultdict.

You may have been looking for:

defaultdict(lambda: defaultdict(dict))

The reasons why you might prefer this construct are:

  • It is more explicit than the recursive solution, and therefore likely more understandable to the reader.
  • This enables the “leaf” of the defaultdict to be something other than a dictionary, e.g.,: defaultdict(lambda: defaultdict(list)) or defaultdict(lambda: defaultdict(set))

回答 2

有一个不错的技巧:

tree = lambda: defaultdict(tree)

然后,您可以使用创建自己xx = tree()

There is a nifty trick for doing that:

tree = lambda: defaultdict(tree)

Then you can create your x with x = tree().


回答 3

与BrenBarn的解决方案类似,但是不包含tree两次变量名,因此即使更改了变量字典也可以使用:

tree = (lambda f: f(f))(lambda a: (lambda: defaultdict(a(a))))

然后,您可以创建的每个新xx = tree()


对于该def版本,我们可以使用函数闭包作用域来保护数据结构,以免其tree名称被反弹时现有实例停止工作的缺陷。看起来像这样:

from collections import defaultdict

def tree():
    def the_tree():
        return defaultdict(the_tree)
    return the_tree()

Similar to BrenBarn’s solution, but doesn’t contain the name of the variable tree twice, so it works even after changes to the variable dictionary:

tree = (lambda f: f(f))(lambda a: (lambda: defaultdict(a(a))))

Then you can create each new x with x = tree().


For the def version, we can use function closure scope to protect the data structure from the flaw where existing instances stop working if the tree name is rebound. It looks like this:

from collections import defaultdict

def tree():
    def the_tree():
        return defaultdict(the_tree)
    return the_tree()

回答 4

我还将提出更多OOP样式的实现,该实现支持无限嵌套以及正确格式化repr

class NestedDefaultDict(defaultdict):
    def __init__(self, *args, **kwargs):
        super(NestedDefaultDict, self).__init__(NestedDefaultDict, *args, **kwargs)

    def __repr__(self):
        return repr(dict(self))

用法:

my_dict = NestedDefaultDict()
my_dict['a']['b'] = 1
my_dict['a']['c']['d'] = 2
my_dict['b']

print(my_dict)  # {'a': {'b': 1, 'c': {'d': 2}}, 'b': {}}

I would also propose more OOP-styled implementation, which supports infinite nesting as well as properly formatted repr.

class NestedDefaultDict(defaultdict):
    def __init__(self, *args, **kwargs):
        super(NestedDefaultDict, self).__init__(NestedDefaultDict, *args, **kwargs)

    def __repr__(self):
        return repr(dict(self))

Usage:

my_dict = NestedDefaultDict()
my_dict['a']['b'] = 1
my_dict['a']['c']['d'] = 2
my_dict['b']

print(my_dict)  # {'a': {'b': 1, 'c': {'d': 2}}, 'b': {}}

回答 5

这是一个递归函数,用于将递归默认字典转换为普通字典

def defdict_to_dict(defdict, finaldict):
    # pass in an empty dict for finaldict
    for k, v in defdict.items():
        if isinstance(v, defaultdict):
            # new level created and that is the new value
            finaldict[k] = defdict_to_dict(v, {})
        else:
            finaldict[k] = v
    return finaldict

defdict_to_dict(my_rec_default_dict, {})

here is a recursive function to convert a recursive default dict to a normal dict

def defdict_to_dict(defdict, finaldict):
    # pass in an empty dict for finaldict
    for k, v in defdict.items():
        if isinstance(v, defaultdict):
            # new level created and that is the new value
            finaldict[k] = defdict_to_dict(v, {})
        else:
            finaldict[k] = v
    return finaldict

defdict_to_dict(my_rec_default_dict, {})

回答 6

我在这里基于安德鲁的答案。如果要从json或现有字典将数据加载到嵌套程序defaultdict中,请参见以下示例:

def nested_defaultdict(existing=None, **kwargs):
    if existing is None:
        existing = {}
    if not isinstance(existing, dict):
        return existing
    existing = {key: nested_defaultdict(val) for key, val in existing.items()}
    return defaultdict(nested_defaultdict, existing, **kwargs)

https://gist.github.com/nucklehead/2d29628bb49115f3c30e78c071207775

I based this of Andrew’s answer here. If you are looking to load data from a json or an existing dict into the nester defaultdict see this example:

def nested_defaultdict(existing=None, **kwargs):
    if existing is None:
        existing = {}
    if not isinstance(existing, dict):
        return existing
    existing = {key: nested_defaultdict(val) for key, val in existing.items()}
    return defaultdict(nested_defaultdict, existing, **kwargs)

https://gist.github.com/nucklehead/2d29628bb49115f3c30e78c071207775


从列中的字符串中删除不需要的部分

问题:从列中的字符串中删除不需要的部分

我正在寻找一种有效的方法来从DataFrame列的字符串中删除不需要的部分。

数据如下:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

我需要将这些数据修剪为:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

我试过了.str.lstrip('+-')str.rstrip('aAbBcC'),但出现错误:

TypeError: wrapper() takes exactly 1 argument (2 given)

任何指针将不胜感激!

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!


回答 0

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

回答 1

如何从列的字符串中删除不需要的部分?

在最初提出问题的6年后,pandas现在具有大量的“向量化”字符串函数,可以简洁地执行这些字符串操作操作。

该答案将探索其中的一些字符串函数,提出更快的替代方法,最后进行时序比较。


.str.replace

指定要匹配的子字符串/样式,以及要替换为的子字符串。

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您需要将结果转换为整数,则可以使用Series.astype

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

如果您不想df就地修改,请使用DataFrame.assign

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

对于提取要保留的子字符串很有用。

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

使用extract,必须指定至少一个捕获组。expand=False将返回带有第一个捕获组中捕获项目的系列。


.str.split.str.get

假设您所有的字符串都遵循这种一致的结构,则拆分工作有效。

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您正在寻找一般的解决方案,则不建议这样做。


如果您对str 上述基于简洁和可读的访问器的解决方案感到满意,则可以在此处停止。但是,如果您对更快,性能更高的替代产品感兴趣,请继续阅读。


优化:列表理解

在某些情况下,列表理解应优于熊猫字符串函数。原因是因为字符串函数本来就很难向量化(从字面意义上来说),所以大多数字符串和正则表达式函数只是循环包装,开销更大。

我写的文章,熊猫中的for循环真的不好吗?我什么时候应该在意?,详细介绍。

str.replace选项可以使用重写re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

str.extract示例可以使用列表理解用来重写re.search

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果可能出现NaN或不匹配的情况,则您需要重新编写上面的内容以包含一些错误检查。我使用一个函数来做到这一点。

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

我们还可以使用列表推导来重写@eumiro和@MonkeyButter的答案:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

和,

df['result'] = [x[1:-1] for x in df['result']]

适用于处理NaN等的相同规则。


性能比较

使用perfplot生成的图。完整的代码清单,供您参考。相关功能在下面列出。

这些比较中的一些比较不公平,因为它们利用了OP数据的结构,但从中得到了好处。需要注意的一件事是,每个列表理解功能都比其等效的pandas变体更快或更可比。

功能

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.


.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don’t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.


.str.split and .str.get

Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.


If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.


Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.


Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

回答 2

我会使用熊猫替换功能,因为您可以使用正则表达式,所以它非常简单而强大。在下面,我使用正则表达式\ D删除所有非数字字符,但显然,使用正则表达式可以变得很有创意。

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

回答 3

在特定情况下,如果您知道要从数据框列中删除的位置数,则可以在lambda函数内使用字符串索引来摆脱这些部分:

最后符:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

前两个字符:

data['result'] = data['result'].map(lambda x: str(x)[2:])

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

回答 4

这里有一个错误:目前无法将参数传递给str.lstripstr.rstrip

http://github.com/pydata/pandas/issues/2411

编辑:2012-12-07这现在可以在dev分支上工作:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

回答 5

一种非常简单的方法是使用该extract方法选择所有数字。只需为其提供'\d+'可提取任意数字的正则表达式即可。

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

回答 6

对于这些类型的任务,我经常使用列表推导,因为它们通常更快。

进行这种操作的各种方法(例如,修改DataFrame中序列的每个元素)的性能可能存在很大差异。通常,列表理解可能是最快的-有关此任务,请参见下面的代码竞赛:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

I often use list comprehensions for these types of tasks because they’re often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

回答 7

假设您的DF在数字之间也有那些多余的字符。

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

您可以尝试str.replace删除字符,不仅从开头和结尾,而且从中间删除。

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

输出:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

回答 8

使用正则表达式尝试:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

如何获取按python创建日期排序的目录列表?

问题:如何获取按python创建日期排序的目录列表?

获取目录中所有文件的列表的最佳方法是什么,按日期[创建| 修改],在Windows机器上使用python?

What is the best way to get a list of all files in a directory, sorted by date [created | modified], using python, on a windows machine?


回答 0

更新dirpath在Python 3中按修改日期对条目进行排序:

import os
from pathlib import Path

paths = sorted(Path(dirpath).iterdir(), key=os.path.getmtime)

(在这里输入@Pygirl的答案以提高知名度)

如果您已经有了一个文件名列表files,则可以在Windows上按创建时间对其进行排序:

files.sort(key=os.path.getctime)

例如,您可以使用@Jay的答案中glob所示的文件列表。


老答案 这里有一个更详细的版本@Greg Hewgill的答案。这是最符合问题要求的。它区分了创建日期和修改日期(至少在Windows上如此)。

#!/usr/bin/env python
from stat import S_ISREG, ST_CTIME, ST_MODE
import os, sys, time

# path to the directory (relative or absolute)
dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

# get all entries in the directory w/ stats
entries = (os.path.join(dirpath, fn) for fn in os.listdir(dirpath))
entries = ((os.stat(path), path) for path in entries)

# leave only regular files, insert creation date
entries = ((stat[ST_CTIME], path)
           for stat, path in entries if S_ISREG(stat[ST_MODE]))
#NOTE: on Windows `ST_CTIME` is a creation date 
#  but on Unix it could be something else
#NOTE: use `ST_MTIME` to sort by a modification date

for cdate, path in sorted(entries):
    print time.ctime(cdate), os.path.basename(path)

例:

$ python stat_creation_date.py
Thu Feb 11 13:31:07 2009 stat_creation_date.py

Update: to sort dirpath‘s entries by modification date in Python 3:

import os
from pathlib import Path

paths = sorted(Path(dirpath).iterdir(), key=os.path.getmtime)

(put @Pygirl’s answer here for greater visibility)

If you already have a list of filenames files, then to sort it inplace by creation time on Windows:

files.sort(key=os.path.getctime)

The list of files you could get, for example, using glob as shown in @Jay’s answer.


old answer Here’s a more verbose version of @Greg Hewgill‘s answer. It is the most conforming to the question requirements. It makes a distinction between creation and modification dates (at least on Windows).

#!/usr/bin/env python
from stat import S_ISREG, ST_CTIME, ST_MODE
import os, sys, time

# path to the directory (relative or absolute)
dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

# get all entries in the directory w/ stats
entries = (os.path.join(dirpath, fn) for fn in os.listdir(dirpath))
entries = ((os.stat(path), path) for path in entries)

# leave only regular files, insert creation date
entries = ((stat[ST_CTIME], path)
           for stat, path in entries if S_ISREG(stat[ST_MODE]))
#NOTE: on Windows `ST_CTIME` is a creation date 
#  but on Unix it could be something else
#NOTE: use `ST_MTIME` to sort by a modification date

for cdate, path in sorted(entries):
    print time.ctime(cdate), os.path.basename(path)

Example:

$ python stat_creation_date.py
Thu Feb 11 13:31:07 2009 stat_creation_date.py

回答 1

过去,我是使用Python脚本来确定目录中最近更新的文件的方式:

import glob
import os

search_dir = "/mydir/"
# remove anything from the list that is not a file (directories, symlinks)
# thanks to J.F. Sebastion for pointing out that the requirement was a list 
# of files (presumably not including directories)  
files = list(filter(os.path.isfile, glob.glob(search_dir + "*")))
files.sort(key=lambda x: os.path.getmtime(x))

这应该可以根据文件mtime执行您想要的操作。

编辑:请注意,如果需要,也可以使用os.listdir()代替glob.glob()-我在原始代码中使用glob的原因是我想使用glob仅搜索具有特定集合的文件文件扩展名,glob()更适合。要使用listdir,结果如下所示:

import os

search_dir = "/mydir/"
os.chdir(search_dir)
files = filter(os.path.isfile, os.listdir(search_dir))
files = [os.path.join(search_dir, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))

I’ve done this in the past for a Python script to determine the last updated files in a directory:

import glob
import os

search_dir = "/mydir/"
# remove anything from the list that is not a file (directories, symlinks)
# thanks to J.F. Sebastion for pointing out that the requirement was a list 
# of files (presumably not including directories)  
files = list(filter(os.path.isfile, glob.glob(search_dir + "*")))
files.sort(key=lambda x: os.path.getmtime(x))

That should do what you’re looking for based on file mtime.

EDIT: Note that you can also use os.listdir() in place of glob.glob() if desired – the reason I used glob in my original code was that I was wanting to use glob to only search for files with a particular set of file extensions, which glob() was better suited to. To use listdir here’s what it would look like:

import os

search_dir = "/mydir/"
os.chdir(search_dir)
files = filter(os.path.isfile, os.listdir(search_dir))
files = [os.path.join(search_dir, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))

回答 2

有一个os.path.getmtime函数可以指定自epoch以来的秒数,并且应快于os.stat

import os 

os.chdir(directory)
sorted(filter(os.path.isfile, os.listdir('.')), key=os.path.getmtime)

There is an os.path.getmtime function that gives the number of seconds since the epoch and should be faster than os.stat.

import os 

os.chdir(directory)
sorted(filter(os.path.isfile, os.listdir('.')), key=os.path.getmtime)

回答 3

这是我的版本:

def getfiles(dirpath):
    a = [s for s in os.listdir(dirpath)
         if os.path.isfile(os.path.join(dirpath, s))]
    a.sort(key=lambda s: os.path.getmtime(os.path.join(dirpath, s)))
    return a

首先,我们建立文件名列表。isfile()用于跳过目录;如果应包括目录,则可以省略。然后,我们使用修改日期作为关键字对列表进行排序。

Here’s my version:

def getfiles(dirpath):
    a = [s for s in os.listdir(dirpath)
         if os.path.isfile(os.path.join(dirpath, s))]
    a.sort(key=lambda s: os.path.getmtime(os.path.join(dirpath, s)))
    return a

First, we build a list of the file names. isfile() is used to skip directories; it can be omitted if directories should be included. Then, we sort the list in-place, using the modify date as the key.


回答 4

这里是单线:

import os
import time
from pprint import pprint

pprint([(x[0], time.ctime(x[1].st_ctime)) for x in sorted([(fn, os.stat(fn)) for fn in os.listdir(".")], key = lambda x: x[1].st_ctime)])

这将调用os.listdir()以获得文件名列表,然后为每个文件名调用os.stat()以获得创建时间,然后根据创建时间进行排序。

请注意,此方法仅对每个文件调用os.stat()一次,这比对某种比较中的每个比较调用它更有效。

Here’s a one-liner:

import os
import time
from pprint import pprint

pprint([(x[0], time.ctime(x[1].st_ctime)) for x in sorted([(fn, os.stat(fn)) for fn in os.listdir(".")], key = lambda x: x[1].st_ctime)])

This calls os.listdir() to get a list of the filenames, then calls os.stat() for each one to get the creation time, then sorts against the creation time.

Note that this method only calls os.stat() once for each file, which will be more efficient than calling it for each comparison in a sort.


回答 5

不更改目录:

import os    

path = '/path/to/files/'
name_list = os.listdir(path)
full_list = [os.path.join(path,i) for i in name_list]
time_sorted_list = sorted(full_list, key=os.path.getmtime)

print time_sorted_list

# if you want just the filenames sorted, simply remove the dir from each
sorted_filename_list = [ os.path.basename(i) for i in time_sorted_list]
print sorted_filename_list

Without changing directory:

import os    

path = '/path/to/files/'
name_list = os.listdir(path)
full_list = [os.path.join(path,i) for i in name_list]
time_sorted_list = sorted(full_list, key=os.path.getmtime)

print time_sorted_list

# if you want just the filenames sorted, simply remove the dir from each
sorted_filename_list = [ os.path.basename(i) for i in time_sorted_list]
print sorted_filename_list

回答 6

在python 3.5+

from pathlib import Path
sorted(Path('.').iterdir(), key=lambda f: f.stat().st_mtime)

In python 3.5+

from pathlib import Path
sorted(Path('.').iterdir(), key=lambda f: f.stat().st_mtime)

回答 7

如果您想按日期顺序读取具有某些扩展名的文件,这是我使用不带过滤器的glob的答案(Python 3)。

dataset_path='/mydir/'   
files = glob.glob(dataset_path+"/morepath/*.extension")   
files.sort(key=os.path.getmtime)

Here’s my answer using glob without filter if you want to read files with a certain extension in date order (Python 3).

dataset_path='/mydir/'   
files = glob.glob(dataset_path+"/morepath/*.extension")   
files.sort(key=os.path.getmtime)

回答 8

# *** the shortest and best way ***
# getmtime --> sort by modified time
# getctime --> sort by created time

import glob,os

lst_files = glob.glob("*.txt")
lst_files.sort(key=os.path.getmtime)
print("\n".join(lst_files))
# *** the shortest and best way ***
# getmtime --> sort by modified time
# getctime --> sort by created time

import glob,os

lst_files = glob.glob("*.txt")
lst_files.sort(key=os.path.getmtime)
print("\n".join(lst_files))

回答 9

sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.stat(p).st_mtime)

您可以使用os.walk('.').next()[-1]而不是进行过滤os.path.isfile,但这会在列表中留下os.stat无效的符号链接,从而使它们失败。

sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.stat(p).st_mtime)

You could use os.walk('.').next()[-1] instead of filtering with os.path.isfile, but that leaves dead symlinks in the list, and os.stat will fail on them.


回答 10

from pathlib import Path
import os

sorted(Path('./').iterdir(), key=lambda t: t.stat().st_mtime)

要么

sorted(Path('./').iterdir(), key=os.path.getmtime)

要么

sorted(os.scandir('./'), key=lambda t: t.stat().st_mtime)

其中,m时间为修改时间。

from pathlib import Path
import os

sorted(Path('./').iterdir(), key=lambda t: t.stat().st_mtime)

or

sorted(Path('./').iterdir(), key=os.path.getmtime)

or

sorted(os.scandir('./'), key=lambda t: t.stat().st_mtime)

where m time is modified time.


回答 11

这是学习的基本步骤:

import os, stat, sys
import time

dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

listdir = os.listdir(dirpath)

for i in listdir:
    os.chdir(dirpath)
    data_001 = os.path.realpath(i)
    listdir_stat1 = os.stat(data_001)
    listdir_stat2 = ((os.stat(data_001), data_001))
    print time.ctime(listdir_stat1.st_ctime), data_001

this is a basic step for learn:

import os, stat, sys
import time

dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'

listdir = os.listdir(dirpath)

for i in listdir:
    os.chdir(dirpath)
    data_001 = os.path.realpath(i)
    listdir_stat1 = os.stat(data_001)
    listdir_stat2 = ((os.stat(data_001), data_001))
    print time.ctime(listdir_stat1.st_ctime), data_001

回答 12

如果文件是到不存在文件的符号链接,则Alex Coventry的答案将产生异常,以下代码更正了该答案:

import time
import datetime
sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.path.exists(p) and os.stat(p).st_mtime or time.mktime(datetime.now().timetuple())

如果文件不存在,则使用no​​w(),符号链接将位于列表的最后。

Alex Coventry’s answer will produce an exception if the file is a symlink to an unexistent file, the following code corrects that answer:

import time
import datetime
sorted(filter(os.path.isfile, os.listdir('.')), 
    key=lambda p: os.path.exists(p) and os.stat(p).st_mtime or time.mktime(datetime.now().timetuple())

When the file doesn’t exist, now() is used, and the symlink will go at the very end of the list.


回答 13

这是一条简单的几行,用于查找扩展并提供排序选项

def get_sorted_files(src_dir, regex_ext='*', sort_reverse=False): 
    files_to_evaluate = [os.path.join(src_dir, f) for f in os.listdir(src_dir) if re.search(r'.*\.({})$'.format(regex_ext), f)]
    files_to_evaluate.sort(key=os.path.getmtime, reverse=sort_reverse)
    return files_to_evaluate

Here is a simple couple lines that looks for extention as well as provides a sort option

def get_sorted_files(src_dir, regex_ext='*', sort_reverse=False): 
    files_to_evaluate = [os.path.join(src_dir, f) for f in os.listdir(src_dir) if re.search(r'.*\.({})$'.format(regex_ext), f)]
    files_to_evaluate.sort(key=os.path.getmtime, reverse=sort_reverse)
    return files_to_evaluate

回答 14

为了os.scandir确保完整性(比快2倍pathlib):

import os
sorted(os.scandir('/tmp/test'), key=lambda d: d.stat().st_mtime)

For completeness with os.scandir (2x faster over pathlib):

import os
sorted(os.scandir('/tmp/test'), key=lambda d: d.stat().st_mtime)

回答 15

这是我的版本:

import os

folder_path = r'D:\Movies\extra\new\dramas' # your path
os.chdir(folder_path) # make the path active
x = sorted(os.listdir(), key=os.path.getctime)  # sorted using creation time

folder = 0

for folder in range(len(x)):
    print(x[folder]) # print all the foldername inside the folder_path
    folder = +1

This was my version:

import os

folder_path = r'D:\Movies\extra\new\dramas' # your path
os.chdir(folder_path) # make the path active
x = sorted(os.listdir(), key=os.path.getctime)  # sorted using creation time

folder = 0

for folder in range(len(x)):
    print(x[folder]) # print all the foldername inside the folder_path
    folder = +1

回答 16

也许您应该使用shell命令。在Unix / Linux中,使用sort传递的find可能可以执行您想要的操作。

Maybe you should use shell commands. In Unix/Linux, find piped with sort will probably be able to do what you want.


从列表中随机选择50个项目写入文件

问题:从列表中随机选择50个项目写入文件

到目前为止,我已经弄清楚了如何导入文件,创建新文件以及使列表随机化。

我在从列表中随机选择50个项目以写入文件时遇到麻烦吗?

def randomizer(input,output1='random_1.txt',output2='random_2.txt',output3='random_3.txt',output4='random_total.txt'):

#Input file 
    query=open(input,'r').read().split()
    dir,file=os.path.split(input)

    temp1 = os.path.join(dir,output1)
    temp2 = os.path.join(dir,output2)
    temp3 = os.path.join(dir,output3)
    temp4 = os.path.join(dir,output4)


    out_file4=open(temp4,'w')

    random.shuffle(query)

    for item in query:
        out_file4.write(item+'\n')   

因此,如果总随机文件为

example:

random_total = ['9','2','3','1','5','6','8','7','0','4']

我想要3个文件(out_file1 | 2 | 3),其中第一个随机集为3,第二个随机集为3,第三个随机集为3(对于此示例,但我要创建的文件应该有50个)

random_1 = ['9','2','3']
random_2 = ['1','5','6']
random_3 = ['8','7','0']

因此,不会包含最后一个“ 4”,这很好。

如何从随机选择的列表中选择50?

更好的是,如何从原始列表中随机选择50个?

So far I have figured out how to import the file, create new files, and randomize the list.

I’m having trouble selecting only 50 items from the list randomly to write to a file?

def randomizer(input,output1='random_1.txt',output2='random_2.txt',output3='random_3.txt',output4='random_total.txt'):

#Input file 
    query=open(input,'r').read().split()
    dir,file=os.path.split(input)

    temp1 = os.path.join(dir,output1)
    temp2 = os.path.join(dir,output2)
    temp3 = os.path.join(dir,output3)
    temp4 = os.path.join(dir,output4)


    out_file4=open(temp4,'w')

    random.shuffle(query)

    for item in query:
        out_file4.write(item+'\n')   

So if the total randomization file was

example:

random_total = ['9','2','3','1','5','6','8','7','0','4']

I would want 3 files (out_file1|2|3) with the first random set of 3, second random set of 3, and third random set of 3 (for this example, but the one I want to create should have 50)

random_1 = ['9','2','3']
random_2 = ['1','5','6']
random_3 = ['8','7','0']

So the last ‘4’ will not be included which is fine.

How can I select 50 from the list that I randomized ?

Even better, how could I select 50 at random from the original list ?


回答 0

如果列表按随机顺序排列,则可以只取前50个。

否则,使用

import random
random.sample(the_list, 50)

random.sample 帮助文字:

sample(self, population, k) method of random.Random instance
    Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)

If the list is in random order, you can just take the first 50.

Otherwise, use

import random
random.sample(the_list, 50)

random.sample help text:

sample(self, population, k) method of random.Random instance
    Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)

回答 1

选择随机项目的一种简单方法是先随机播放然后切片。

import random
a = [1,2,3,4,5,6,7,8,9]
random.shuffle(a)
print a[:4] # prints 4 random variables

One easy way to select random items is to shuffle then slice.

import random
a = [1,2,3,4,5,6,7,8,9]
random.shuffle(a)
print a[:4] # prints 4 random variables

回答 2

我认为random.choice()是更好的选择。

import numpy as np

mylist = [13,23,14,52,6,23]

np.random.choice(mylist, 3, replace=False)

该函数从列表中返回3个随机选择的值的数组

I think random.choice() is a better option.

import numpy as np

mylist = [13,23,14,52,6,23]

np.random.choice(mylist, 3, replace=False)

the function returns an array of 3 randomly chosen values from the list


回答 3

假设您的列表包含100个元素,并且您想随机选择50个元素。以下是要遵循的步骤:

  1. 导入库
  2. 为随机数生成器创建种子,我将其放在2
  3. 准备一个数字列表,从中随机抽取
  4. 从数字列表中进行随机选择

码:

from random import seed
from random import choice

seed(2)
numbers = [i for i in range(100)]

print(numbers)

for _ in range(50):
    selection = choice(numbers)
    print(selection)

Say your list has 100 elements and you want to pick 50 of them in a random way. Here are the steps to follow:

  1. Import the libraries
  2. Create the seed for random number generator, I have put it at 2
  3. Prepare a list of numbers from which to pick up in a random way
  4. Make the random choices from the numbers list

Code:

from random import seed
from random import choice

seed(2)
numbers = [i for i in range(100)]

print(numbers)

for _ in range(50):
    selection = choice(numbers)
    print(selection)

初始化一个numpy数组

问题:初始化一个numpy数组

有没有办法初始化形状的numpy数组并将其添加到其中?我将通过列表示例来说明我需要的内容。如果要创建循环中生成的对象列表,可以执行以下操作:

a = []
for i in range(5):
    a.append(i)

我想对一个numpy数组做类似的事情。我了解vstack,串联等。但是,这些似乎需要两个numpy数组作为输入。我需要的是:

big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
    array i of shape = (2,4) created.
    add to big_array

big_array应具有的形状(10,4)。这该怎么做?


编辑:

我想添加以下说明。我知道我可以定义big_array = numpy.zeros((10,4))然后填充它。但是,这需要预先指定big_array的大小。我知道这种情况下的大小,但是如果我不知道该怎么办?当我们使用该.append函数在python中扩展列表时,我们不需要事先知道其最终大小。我想知道是否存在从空数组开始的从较小数组创建较大数组的类似方法。

Is there way to initialize a numpy array of a shape and add to it? I will explain what I need with a list example. If I want to create a list of objects generated in a loop, I can do:

a = []
for i in range(5):
    a.append(i)

I want to do something similar with a numpy array. I know about vstack, concatenate etc. However, it seems these require two numpy arrays as inputs. What I need is:

big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
    array i of shape = (2,4) created.
    add to big_array

The big_array should have a shape (10,4). How to do this?


EDIT:

I want to add the following clarification. I am aware that I can define big_array = numpy.zeros((10,4)) and then fill it up. However, this requires specifying the size of big_array in advance. I know the size in this case, but what if I do not? When we use the .append function for extending the list in python, we don’t need to know its final size in advance. I am wondering if something similar exists for creating a bigger array from smaller arrays, starting with an empty array.


回答 0

numpy.zeros

返回给定形状和类型的新数组,并用零填充。

要么

numpy.ones

返回给定形状和类型的新数组,并填充其中的一个。

要么

numpy.empty

返回给定形状和类型的新数组,而无需初始化条目。


但是,通过将元素追加到列表来构造数组的思路在numpy中使用不多,因为它效率较低(numpy数据类型更接近基础C数组)。相反,您应该将数组预分配为所需的大小,然后填写行。不过,您可以numpy.append根据需要使用。

numpy.zeros

Return a new array of given shape and type, filled with zeros.

or

numpy.ones

Return a new array of given shape and type, filled with ones.

or

numpy.empty

Return a new array of given shape and type, without initializing entries.


However, the mentality in which we construct an array by appending elements to a list is not much used in numpy, because it’s less efficient (numpy datatypes are much closer to the underlying C arrays). Instead, you should preallocate the array to the size that you need it to be, and then fill in the rows. You can use numpy.append if you must, though.


回答 1

我通常这样做的方法是创建一个常规列表,然后将其添加到列表中,最后将列表转换为numpy数组,如下所示:

import numpy as np
big_array = [] #  empty regular list
for i in range(5):
    arr = i*np.ones((2,4)) # for instance
    big_array.append(arr)
big_np_array = np.array(big_array)  # transformed to a numpy array

当然,最终对象在创建步骤中占用的内存空间是原来的两倍,但是追加到python列表上的速度非常快,并且使用np.array()进行创建也是如此。

The way I usually do that is by creating a regular list, then append my stuff into it, and finally transform the list to a numpy array as follows :

import numpy as np
big_array = [] #  empty regular list
for i in range(5):
    arr = i*np.ones((2,4)) # for instance
    big_array.append(arr)
big_np_array = np.array(big_array)  # transformed to a numpy array

of course your final object takes twice the space in the memory at the creation step, but appending on python list is very fast, and creation using np.array() also.


回答 2

在numpy 1.8中引入:

numpy.full

返回给定形状和类型的新数组,并用fill_value填充。

例子:

>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf,  inf],
       [ inf,  inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
       [10, 10]])

Introduced in numpy 1.8:

numpy.full

Return a new array of given shape and type, filled with fill_value.

Examples:

>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf,  inf],
       [ inf,  inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
       [10, 10]])

回答 3

python的数组模拟

a = []
for i in range(5):
    a.append(i)

是:

import numpy as np

a = np.empty((0))
for i in range(5):
    a = np.append(a, i)

Array analogue for the python’s

a = []
for i in range(5):
    a.append(i)

is:

import numpy as np

a = np.empty((0))
for i in range(5):
    a = np.append(a, i)

回答 4

numpy.fromiter() 您正在寻找的是:

big_array = numpy.fromiter(xrange(5), dtype="int")

它也适用于生成器表达式,例如:

big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )

如果事先知道数组的长度,则可以使用可选的’count’参数指定它的长度。

numpy.fromiter() is what you are looking for:

big_array = numpy.fromiter(xrange(5), dtype="int")

It also works with generator expressions, e.g.:

big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )

If you know the length of the array in advance, you can specify it with an optional ‘count’ argument.


回答 5

您确实希望在进行数组计算时尽可能避免显式循环,因为这会降低这种形式的计算的速度增益。有多种初始化numpy数组的方法。如果要用零填充,请按照katrielalex的指示进行:

big_array = numpy.zeros((10,4))

编辑:您正在制作哪种顺序?您应该查看创建数组的不同numpy函数,例如numpy.linspace(start, stop, size)(等号)或numpy.arange(start, stop, inc)。在可能的情况下,这些函数将使数组比在显式循环中完成相同工作的速度快得多

You do want to avoid explicit loops as much as possible when doing array computing, as that reduces the speed gain from that form of computing. There are multiple ways to initialize a numpy array. If you want it filled with zeros, do as katrielalex said:

big_array = numpy.zeros((10,4))

EDIT: What sort of sequence is it you’re making? You should check out the different numpy functions that create arrays, like numpy.linspace(start, stop, size) (equally spaced number), or numpy.arange(start, stop, inc). Where possible, these functions will make arrays substantially faster than doing the same work in explicit loops


回答 6

对于您的第一个数组示例,

a = numpy.arange(5)

要初始化big_array,请使用

big_array = numpy.zeros((10,4))

假设您要用零初始化,这很典型,但是还有许多其他方法可以在numpy中初始化数组

编辑: 如果您事先不知道big_array的大小,通常最好首先使用append构建一个Python列表,并且当列表中收集了所有内容时,请使用将该列表转换为numpy数组numpy.array(mylist)。原因是列表的目的是非常高效和快速地增长,而numpy.concatenate效率很低,因为numpy数组不容易更改大小。但是,一旦所有内容都收集到列表中,并且您知道最终的数组大小,就可以有效地构造一个numpy数组。

For your first array example use,

a = numpy.arange(5)

To initialize big_array, use

big_array = numpy.zeros((10,4))

This assumes you want to initialize with zeros, which is pretty typical, but there are many other ways to initialize an array in numpy.

Edit: If you don’t know the size of big_array in advance, it’s generally best to first build a Python list using append, and when you have everything collected in the list, convert this list to a numpy array using numpy.array(mylist). The reason for this is that lists are meant to grow very efficiently and quickly, whereas numpy.concatenate would be very inefficient since numpy arrays don’t change size easily. But once everything is collected in a list, and you know the final array size, a numpy array can be efficiently constructed.


回答 7

要使用特定矩阵初始化numpy数组,请执行以下操作:

import numpy as np

mat = np.array([[1, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [1, 0, 0, 1, 1],
                [0, 0, 0, 0, 0],
                [1, 0, 1, 0, 1]])

print mat.shape
print mat

输出:

(5, 5)
[[1 1 0 0 0]
 [0 1 0 0 1]
 [1 0 0 1 1]
 [0 0 0 0 0]
 [1 0 1 0 1]]

To initialize a numpy array with a specific matrix:

import numpy as np

mat = np.array([[1, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [1, 0, 0, 1, 1],
                [0, 0, 0, 0, 0],
                [1, 0, 1, 0, 1]])

print mat.shape
print mat

output:

(5, 5)
[[1 1 0 0 0]
 [0 1 0 0 1]
 [1 0 0 1 1]
 [0 0 0 0 0]
 [1 0 1 0 1]]

回答 8

每当您处于以下情况时:

a = []
for i in range(5):
    a.append(i)

并且您想要类似numpy的内容,先前的几个答案已经指出了实现方法,但是正如@katrielalex指出的那样,这些方法效率不高。执行此操作的有效方法是建立一个长列表,然后在拥有一个长列表后以所需的方式重塑它。例如,假设我正在从文件中读取一些行,并且每一行都有一个数字列表,并且我想构建一个形状为numpy的数组(读取的行数,每一行中的向量长度)。这是我将更有效地执行此操作的方法:

long_list = []
counter = 0
with open('filename', 'r') as f:
    for row in f:
        row_list = row.split()
        long_list.extend(row_list)
        counter++
#  now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) #  desired numpy array

Whenever you are in the following situation:

a = []
for i in range(5):
    a.append(i)

and you want something similar in numpy, several previous answers have pointed out ways to do it, but as @katrielalex pointed out these methods are not efficient. The efficient way to do this is to build a long list and then reshape it the way you want after you have a long list. For example, let’s say I am reading some lines from a file and each row has a list of numbers and I want to build a numpy array of shape (number of lines read, length of vector in each row). Here is how I would do it more efficiently:

long_list = []
counter = 0
with open('filename', 'r') as f:
    for row in f:
        row_list = row.split()
        long_list.extend(row_list)
        counter++
#  now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) #  desired numpy array

回答 9

我意识到这有点晚了,但是我没有注意到提到索引到空数组的其他答案:

big_array = numpy.empty(10, 4)
for i in range(5):
    array_i = numpy.random.random(2, 4)
    big_array[2 * i:2 * (i + 1), :] = array_i

这样,您numpy.empty可以使用索引分配预先分配整个结果数组,并在行中填写行。

使用预分配empty而不是zeros您给出的示例是完全安全的,因为您可以保证整个数组将被生成的块填充。

I realize that this is a bit late, but I did not notice any of the other answers mentioning indexing into the empty array:

big_array = numpy.empty(10, 4)
for i in range(5):
    array_i = numpy.random.random(2, 4)
    big_array[2 * i:2 * (i + 1), :] = array_i

This way, you preallocate the entire result array with numpy.empty and fill in the rows as you go using indexed assignment.

It is perfectly safe to preallocate with empty instead of zeros in the example you gave since you are guaranteeing that the entire array will be filled with the chunks you generate.


回答 10

我建议先定义形状。然后对其进行迭代以插入值。

big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
    big_array[it] = (it,it) # For example

>>>big_array

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.],
       [ 5.,  5.]])

I’d suggest defining shape first. Then iterate over it to insert values.

big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
    big_array[it] = (it,it) # For example

>>>big_array

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.],
       [ 5.,  5.]])

回答 11

也许这样的东西会满足您的需求。

import numpy as np

N = 5
res = []

for i in range(N):
    res.append(np.cumsum(np.ones(shape=(2,4))))

res = np.array(res).reshape((10, 4))
print(res)

产生以下输出

[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

Maybe something like this will fit your needs..

import numpy as np

N = 5
res = []

for i in range(N):
    res.append(np.cumsum(np.ones(shape=(2,4))))

res = np.array(res).reshape((10, 4))
print(res)

Which produces the following output

[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

Python提取模式匹配

问题:Python提取模式匹配

Python 2.7.1我正在尝试使用python正则表达式来提取模式内的单词

我有一些看起来像这样的字符串

someline abc
someother line
name my_user_name is valid
some more lines

我要提取单词“ my_user_name”。我做类似的事情

import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) #this gives me <_sre.SRE_Match object at 0x026B6838>

如何立即提取my_user_name?

Python 2.7.1 I am trying to use python regular expression to extract words inside of a pattern

I have some string that looks like this

someline abc
someother line
name my_user_name is valid
some more lines

I want to extract the word “my_user_name”. I do something like

import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) #this gives me <_sre.SRE_Match object at 0x026B6838>

How do I extract my_user_name now?


回答 0

您需要从正则表达式捕获。search对于模式,如果找到,请使用检索字符串group(index)。假设执行了有效的检查:

>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1)     # group(1) will return the 1st capture.
                        # group(0) will returned the entire matched text.
'my_user_name'

You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:

>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1)     # group(1) will return the 1st capture (stuff within the brackets).
                        # group(0) will returned the entire matched text.
'my_user_name'

回答 1

您可以使用匹配组:

p = re.compile('name (.*) is valid')

例如

>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']

在这里,我使用re.findall而不是re.search获取的所有实例my_user_name。使用re.search,您需要从match对象上的组中获取数据:

>>> p.search(s)   #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'

如评论中所述,您可能希望使正则表达式不贪心:

p = re.compile('name (.*?) is valid')

只能提取到'name '下一个之间的内容' is valid'(而不是让您的正则表达式来提取' is valid'组中的其他内容。

You can use matching groups:

p = re.compile('name (.*) is valid')

e.g.

>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']

Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you’d need to get the data from the group on the match object:

>>> p.search(s)   #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'

As mentioned in the comments, you might want to make your regex non-greedy:

p = re.compile('name (.*?) is valid')

to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.


回答 2

您可以使用如下形式:

import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen 
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
    name = m.group(1)
else:
    raise Exception('name not found')

You could use something like this:

import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen 
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
    name = m.group(1)
else:
    raise Exception('name not found')

回答 3

也许这更短一些,更容易理解:

import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'

Maybe that’s a bit shorter and easier to understand:

import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'

回答 4

您需要一个捕获组

p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.

You want a capture group.

p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.

回答 5

您可以使用组(用'('和表示')')捕获字符串的一部分。然后,match对象的group()方法为您提供组的内容:

>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0)  # the entire match
'name my_user_name is valid'
>>> match.group(1)  # the first parenthesized subgroup
'my_user_name'

在Python 3.6及更高版本中,您也可以索引到match对象中,而不是使用group()

>>> match[0]  # the entire match 
'name my_user_name is valid'
>>> match[1]  # the first parenthesized subgroup
'my_user_name'

You can use groups (indicated with '(' and ')') to capture parts of the string. The match object’s group() method then gives you the group’s contents:

>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0)  # the entire match
'name my_user_name is valid'
>>> match.group(1)  # the first parenthesized subgroup
'my_user_name'

In Python 3.6+ you can also index into a match object instead of using group():

>>> match[0]  # the entire match 
'name my_user_name is valid'
>>> match[1]  # the first parenthesized subgroup
'my_user_name'

回答 6

这是一种无需使用组(Python 3.6或更高版本)的方法:

>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'

Here’s a way to do it without using groups (Python 3.6 or above):

>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'

回答 7

您还可以使用捕获组(?P<user>pattern)并像字典一样访问该组match['user']

string = '''someline abc\n
            someother line\n
            name my_user_name is valid\n
            some more lines\n'''

pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])

# my_user_name

You can also use a capture group (?P<user>pattern) and access the group like a dictionary match['user'].

string = '''someline abc\n
            someother line\n
            name my_user_name is valid\n
            some more lines\n'''

pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])

# my_user_name

回答 8

看来您实际上是在尝试提取名称,而只是找到一个匹配项。在这种情况下,为您的比赛设置跨度索引会有所帮助,我建议您使用re.finditer。作为快捷方式,您知道name正则表达式的部分是长度5,而is valid长度是9,因此您可以对匹配的文本进行切片以提取名称。

注意-在您的示例中,它看起来像是s带有换行符的字符串,因此以下假设。

## covert s to list of strings separated by line:
s2 = s.splitlines()

## find matches by line: 
for i, j in enumerate(s2):
    matches = re.finditer("name (.*) is valid", j)
    ## ignore lines without a match
    if matches:
        ## loop through match group elements
        for k in matches:
            ## get text
            match_txt = k.group(0)
            ## get line span
            match_span = k.span(0)
            ## extract username
            my_user_name = match_txt[5:-9]
            ## compare with original text
            print(f'Extracted Username: {my_user_name} - found on line {i}')
            print('Match Text:', match_txt)

It seems like you’re actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I’d recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.

Note – In your example, it looks like s is string with line breaks, so that’s what’s assumed below.

## covert s to list of strings separated by line:
s2 = s.splitlines()

## find matches by line: 
for i, j in enumerate(s2):
    matches = re.finditer("name (.*) is valid", j)
    ## ignore lines without a match
    if matches:
        ## loop through match group elements
        for k in matches:
            ## get text
            match_txt = k.group(0)
            ## get line span
            match_span = k.span(0)
            ## extract username
            my_user_name = match_txt[5:-9]
            ## compare with original text
            print(f'Extracted Username: {my_user_name} - found on line {i}')
            print('Match Text:', match_txt)

Anaconda导出环境文件

问题:Anaconda导出环境文件

如何制作可以在其他计算机上使用的anaconda环境文件?

我使用将Anaconda python环境导出到YML conda env export > environment.yml。导出的environment.yml内容包含此行prefix: /home/superdev/miniconda3/envs/juicyenv,它映射到我的anaconda的位置,这在其他计算机上将有所不同。

How can I make anaconda environment file which could be use on other computers?

I exported my anaconda python environment to YML using conda env export > environment.yml. The exported environment.yml contains this line prefix: /home/superdev/miniconda3/envs/juicyenv which maps to my anaconda’s location which will be different on other’s pcs.


回答 0

我在conda规范中找不到任何可让您导出环境文件的内容prefix: ...。但是,正如Alex在评论中指出的那样,从文件创建环境时,conda似乎并不关心前缀行。

考虑到这一点,如果您希望其他用户不了解您的默认安装路径,则可以grep在写入之前删除前缀行environment.yml

conda env export | grep -v "^prefix: " > environment.yml

无论哪种方式,另一个用户都可以运行:

conda env create -f environment.yml

并且该环境将安装在其默认的conda环境路径中。

如果您要指定与系统默认设置不同的安装路径(与environment.yml中的’prefix’不相关),只需使用-p标记后跟所需的路径即可。

conda env create -f environment.yml -p /home/user/anaconda3/envs/env_name

请注意,Conda建议environment.yml手动创建,这对于要跨平台(Windows / Linux / Mac)共享环境的用户尤其重要。在这种情况下,您可以省略该prefix行。

I can’t find anything in the conda specs which allow you to export an environment file without the prefix: ... line. However, as Alex pointed out in the comments, conda doesn’t seem to care about the prefix line when creating an environment from file.

With that in mind, if you want the other user to have no knowledge of your default install path, you can remove the prefix line with grep before writing to environment.yml.

conda env export | grep -v "^prefix: " > environment.yml

Either way, the other user then runs:

conda env create -f environment.yml

and the environment will get installed in their default conda environment path.

If you want to specify a different install path than the default for your system (not related to ‘prefix’ in the environment.yml), just use the -p flag followed by the required path.

conda env create -f environment.yml -p /home/user/anaconda3/envs/env_name

Note that Conda recommends creating the environment.yml by hand, which is especially important if you are wanting to share your environment across platforms (Windows/Linux/Mac). In this case, you can just leave out the prefix line.


回答 1

从要安装在另一台计算机上的环境中保存软件包的最简单方法是:

$ conda list -e > req.txt

然后您可以使用安装环境

$ conda create -n new environment --file req.txt

如果使用pip,请使用以下命令:reference https://pip.pypa.io/en/stable/reference/pip_freeze/

$ env1/bin/pip freeze > requirements.txt
$ env2/bin/pip install -r requirements.txt

The easiest way to save the packages from an environment to be installed in another computer is:

$ conda list -e > req.txt

then you can install the environment using

$ conda create -n new environment --file req.txt

if you use pip, please use the following commands: reference https://pip.pypa.io/en/stable/reference/pip_freeze/

$ env1/bin/pip freeze > requirements.txt
$ env2/bin/pip install -r requirements.txt

回答 2

  • 的Linux

    conda env导出-无构建| grep -v“前缀”> environment.yml

  • 视窗

    conda env export –no-builds | findstr -v“前缀”> environment.yml


基本原理:默认情况下,conda env export包括构建信息:

$ conda env export
...
dependencies:
  - backcall=0.1.0=py37_0
  - blas=1.0=mkl
  - boto=2.49.0=py_0
...

您可以转而无需构建信息即可导出环境:

$ conda env export --no-builds
...
dependencies:
  - backcall=0.1.0
  - blas=1.0
  - boto=2.49.0
...

这使环境与Python版本和OS脱钩。

  • Linux

    conda env export –no-builds | grep -v “prefix” > environment.yml

  • Windows

    conda env export –no-builds | findstr -v “prefix” > environment.yml


Rationale: By default, conda env export includes the build information:

$ conda env export
...
dependencies:
  - backcall=0.1.0=py37_0
  - blas=1.0=mkl
  - boto=2.49.0=py_0
...

You can instead export your environment without build info:

$ conda env export --no-builds
...
dependencies:
  - backcall=0.1.0
  - blas=1.0
  - boto=2.49.0
...

Which unties the environment from the Python version and OS.


回答 3

我发现仅以字符串格式导出软件包比导出整个conda环境更方便。正如前面的答案已经建议的那样:

$ conda list -e > requirements.txt

但是,它requirements.txt包含内部版本号,这些版本号在操作系统之间(例如Mac和之间)不可移植Ubuntu。在conda env export我们可以选择--no-builds但没有的情况下conda list -e,因此我们可以通过发出以下命令来删除内部版本号:

$ sed -i -E "s/^(.*\=.*)(\=.*)/\1/" requirements.txt 

并在另一台计算机上重新创建环境:

conda create -n recreated_env --file requirements.txt 

I find exporting the packages in string format only is more portable than exporting the whole conda environment. As the previous answer already suggested:

$ conda list -e > requirements.txt

However, this requirements.txt contains build numbers which are not portable between operating systems, e.g. between Mac and Ubuntu. In conda env export we have the option --no-builds but not with conda list -e, so we can remove the build number by issuing the following command:

$ sed -i -E "s/^(.*\=.*)(\=.*)/\1/" requirements.txt 

And recreate the environment on another computer:

conda create -n recreated_env --file requirements.txt 

回答 4

  1. 首先激活您的conda环境(您要导出/备份的环境)
conda activate myEnv
  1. 将所有包导出到文件(myEnvBkp.txt)
conda list --explicit > myEnvBkp.txt
  1. 恢复/导入环境:
conda create --name myEnvRestored --file myEnvBkp.txt
  1. First activate your conda environment (the one u want to export/backup)
conda activate myEnv
  1. Export all packages to a file (myEnvBkp.txt)
conda list --explicit > myEnvBkp.txt
  1. Restore/import the environment:
conda create --name myEnvRestored --file myEnvBkp.txt