Python 实用宝典

Question 1

How can I get the file name and line number in python script.

Exactly the file information we get from an exception traceback. In this case without raising an exception.

Question 2

Thanks to mcandre, the answer is:

#python3
from inspect import currentframe, getframeinfo

frameinfo = getframeinfo(currentframe())

print(frameinfo.filename, frameinfo.lineno)

Question 3

Whether you use currentframe().f_back depends on whether you are using a function or not.

Calling inspect directly:

from inspect import currentframe, getframeinfo

cf = currentframe()
filename = getframeinfo(cf).filename

print "This is line 5, python says line ", cf.f_lineno 
print "The filename is ", filename

Calling a function that does it for you:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno

print "This is line 7, python says line ", get_linenumber()

Question 4

Handy if used in a common file – prints file name, line number and function of the caller:

import inspect
def getLineInfo():
    print(inspect.stack()[1][1],":",inspect.stack()[1][2],":",
          inspect.stack()[1][3])

Question 5

Filename:

__file__
# or
sys.argv[0]

Line:

inspect.currentframe().f_lineno

(not inspect.currentframe().f_back.f_lineno as mentioned above)

Question 6

Better to use sys also-

print dir(sys._getframe())
print dir(sys._getframe().f_lineno)
print sys._getframe().f_lineno

The output is:

['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'f_back', 'f_builtins', 'f_code', 'f_exc_traceback', 'f_exc_type', 'f_exc_value', 'f_globals', 'f_lasti', 'f_lineno', 'f_locals', 'f_restricted', 'f_trace']
['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__', '__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__', '__format__', '__getattribute__', '__getnewargs__', '__hash__', '__hex__', '__index__', '__init__', '__int__', '__invert__', '__long__', '__lshift__', '__mod__', '__mul__', '__neg__', '__new__', '__nonzero__', '__oct__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__trunc__', '__xor__', 'bit_length', 'conjugate', 'denominator', 'imag', 'numerator', 'real']
14

Question 7

Just to contribute,

there is a linecache module in python, here is two links that can help.

linecache module documentation
linecache source code

In a sense, you can “dump” a whole file into its cache , and read it with linecache.cache data from class.

import linecache as allLines
## have in mind that fileName in linecache behaves as any other open statement, you will need a path to a file if file is not in the same directory as script
linesList = allLines.updatechache( fileName ,None)
for i,x in enumerate(lineslist): print(i,x) #prints the line number and content
#or for more info
print(line.cache)
#or you need a specific line
specLine = allLines.getline(fileName,numbOfLine)
#returns a textual line from that number of line

For additional info, for error handling, you can simply use

from sys import exc_info
try:
     raise YourError # or some other error
except Exception:
     print(exc_info() )

Question 8

import inspect    

file_name = __FILE__
current_line_no = inspect.stack()[0][2]
current_function_name = inspect.stack()[0][3]

#Try printing inspect.stack() you can see current stack and pick whatever you want

Question 9

In Python 3 you can use a variation on:

def Deb(msg = None):
  print(f"Debug {sys._getframe().f_back.f_lineno}: {msg if msg is not None else ''}")

In code, you can then use:

Deb("Some useful information")
Deb()

To produce:

123: Some useful information
124:

Where the 123 and 124 are the lines that the calls are made from.

Question 10

Here’s what works for me to get the line number in Python 3.7.3 in VSCode 1.39.2 (dmsg is my mnemonic for debug message):

import inspect

def dmsg(text_s):
    print (str(inspect.currentframe().f_back.f_lineno) + '| ' + text_s)

To call showing a variable name_s and its value:

name_s = put_code_here
dmsg('name_s: ' + name_s)

Output looks like this:

37| name_s: value_of_variable_at_line_37

Question 11

I have two questions:

How do I delete a table in Django?
How do I remove all the data in the table?

This is my code, which is not successful:

Reporter.objects.delete()

Question 12

Inside a manager:

def delete_everything(self):
    Reporter.objects.all().delete()

def drop_table(self):
    cursor = connection.cursor()
    table_name = self.model._meta.db_table
    sql = "DROP TABLE %s;" % (table_name, )
    cursor.execute(sql)

Question 13

As per the latest documentation, the correct method to call would be:

Reporter.objects.all().delete()

Question 14

If you want to remove all the data from all your tables, you might want to try the command python manage.py flush. This will delete all of the data in your tables, but the tables themselves will still exist.

See more here: https://docs.djangoproject.com/en/1.8/ref/django-admin/

Question 15

Using shell,

1) For Deleting the table:

python manage.py dbshell
>> DROP TABLE {app_name}_{model_name}

2) For removing all data from table:

python manage.py shell
>> from {app_name}.models import {model_name}
>> {model_name}.objects.all().delete()

Question 16

Django 1.11 delete all objects from a database table –

Entry.objects.all().delete()  ## Entry being Model Name.

Refer the Official Django documentation here as quoted below – https://docs.djangoproject.com/en/1.11/topics/db/queries/#deleting-objects

Note that delete() is the only QuerySet method that is not exposed on a Manager itself. This is a safety mechanism to prevent you from accidentally requesting Entry.objects.delete(), and deleting all the entries. If you do want to delete all the objects, then you have to explicitly request a complete query set:

I myself tried the code snippet seen below within my somefilename.py

    # for deleting model objects
    from django.db import connection
    def del_model_4(self):
        with connection.schema_editor() as schema_editor:
            schema_editor.delete_model(model_4)

and within my views.py i have a view that simply renders a html page …

  def data_del_4(request):
      obj = calc_2() ## 
      obj.del_model_4()
      return render(request, 'dc_dash/data_del_4.html') ##

it ended deleting all entries from – model == model_4 , but now i get to see a Error screen within Admin console when i try to asceratin that all objects of model_4 have been deleted …

ProgrammingError at /admin/dc_dash/model_4/
relation "dc_dash_model_4" does not exist
LINE 1: SELECT COUNT(*) AS "__count" FROM "dc_dash_model_4"

Do consider that – if we do not go to the ADMIN Console and try and see objects of the model – which have been already deleted – the Django app works just as intended.

django admin screencapture

Question 17

There are a couple of ways:

To delete it directly:

SomeModel.objects.filter(id=id).delete()

To delete it from an instance:

instance1 = SomeModel.objects.get(id=id)
instance1.delete()

// don’t use same name

Question 18

I’ve been searching for the accurate answer to this question for a couple of days now but haven’t got anything good. I’m not a complete beginner in programming, but not yet even on the intermediate level.

When I’m in the shell of Python, I type: dir() and I can see all the names of all the objects in the current scope (main block), there are 6 of them:

['__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__']

Then, when I’m declaring a variable, for example x = 10, it automatically adds to that lists of objects under built-in module dir(), and when I type dir() again, it shows now:

['__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'x']

The same goes for functions, classes and so on.

How do I delete all those new objects without erasing the standard 6 which where available at the beginning?

I’ve read here about “memory cleaning”, “cleaning of the console”, which erases all the text from the command prompt window:

>>> import sys
>>> clear = lambda: os.system('cls')
>>> clear()

But all this has nothing to do with what I’m trying to achieve, it doesn’t clean out all used objects.

Question 19

You can delete individual names with del:

del x

or you can remove them from the globals() object:

for name in dir():
    if not name.startswith('_'):
        del globals()[name]

This is just an example loop; it defensively only deletes names that do not start with an underscore, making a (not unreasoned) assumption that you only used names without an underscore at the start in your interpreter. You could use a hard-coded list of names to keep instead (whitelisting) if you really wanted to be thorough. There is no built-in function to do the clearing for you, other than just exit and restart the interpreter.

Modules you’ve imported (import os) are going to remain imported because they are referenced by sys.modules; subsequent imports will reuse the already imported module object. You just won’t have a reference to them in your current global namespace.

Question 20

Yes. There is a simple way to remove everything in iPython. In iPython console, just type:

%reset

Then system will ask you to confirm. Press y. If you don’t want to see this prompt, simply type:

%reset -f

This should work..

Question 21

You can use python garbage collector:

import gc
gc.collect()

Question 22

If you are in an interactive environment like Jupyter or ipython you might be interested in clearing unwanted var’s if they are getting heavy.

The magic-commands reset and reset_selective is vailable on interactive python sessions like ipython and Jupyter

1) reset

reset Resets the namespace by removing all names defined by the user, if called without arguments.

in and the out parameters specify whether you want to flush the in/out caches. The directory history is flushed with the dhist parameter.

reset in out

Another interesting one is array that only removes numpy Arrays:

reset array

2) reset_selective

Resets the namespace by removing names defined by the user. Input/Output history are left around in case you need them.

Clean Array Example:

In [1]: import numpy as np
In [2]: littleArray = np.array([1,2,3,4,5])
In [3]: who_ls
Out[3]: ['littleArray', 'np']
In [4]: reset_selective -f littleArray
In [5]: who_ls
Out[5]: ['np']

Source: http://ipython.readthedocs.io/en/stable/interactive/magics.html

Question 23

This worked for me.

You need to run it twice once for globals followed by locals

for name in dir():
    if not name.startswith('_'):
        del globals()[name]

for name in dir():
    if not name.startswith('_'):
        del locals()[name]

Question 24

Actually python will reclaim the memory which is not in use anymore.This is called garbage collection which is automatic process in python. But still if you want to do it then you can delete it by del variable_name. You can also do it by assigning the variable to None

a = 10
print a 

del a       
print a      ## throws an error here because it's been deleted already.

The only way to truly reclaim memory from unreferenced Python objects is via the garbage collector. The del keyword simply unbinds a name from an object, but the object still needs to be garbage collected. You can force garbage collector to run using the gc module, but this is almost certainly a premature optimization but it has its own risks. Using del has no real effect, since those names would have been deleted as they went out of scope anyway.

Question 25

I know there are tools which validate whether your Python code is compliant with PEP8, for example there is both an online service and a python module.

However, I cannot find a service or module which can convert my Python file to a self-contained, PEP8 valid Python file. Does anyone know if there are any?
I assume it’s feasible since PEP8 is all about the appearance of the code, right?

Question 26

Unfortunately “pep8 storming” (the entire project) has several negative side-effects:

lots of merge-conflicts
break git blame
make code review difficult

As an alternative (and thanks to @y-p for the idea), I wrote a small package which autopep8s only those lines which you have been working on since the last commit/branch:

Basically leaving the project a little better than you found it:

pip install pep8radius

Suppose you’ve done your work off of master and are ready to commit:

# be somewhere in your project directory
# see the diff with pep, see the changes you've made since master
pep8radius master --diff
# make those changes
pep8radius master --diff --in-place

Or to clean the new lines you’ve commited since the last commit:

pep8radius --diff
pep8radius --diff --in-place

# the lines which changed since a specific commit `git diff 98f51f`
pep8radius 98f51f --diff

Basically pep8radius is applying autopep8 to lines in the output of git/hg diff (from the last shared commit).

This script currently works with git and hg, if your using something else and want this to work please post a comment/issue/PR!

Question 27

You can use autopep8! Whilst you make yourself a cup of coffee this tool happily removes all those pesky PEP8 violations which don’t change the meaning of the code.

Install it via pip:

pip install autopep8

Apply this to a specific file:

autopep8 py_file --in-place

or to your project (recursively), the verbose option gives you some feedback of how it’s going:

autopep8 project_dir --recursive --in-place --pep8-passes 2000 --verbose

Note: Sometimes the default of 100 passes isn’t enough, I set it to 2000 as it’s reasonably high and will catch all but the most troublesome files (it stops passing once it finds no resolvable pep8 infractions)…

At this point I suggest retesting and doing a commit!

If you want “full” PEP8 compliance: one tactic I’ve used is to run autopep8 as above, then run PEP8, which prints the remaining violations (file, line number, and what):

pep8 project_dir --ignore=E501

and manually change these individually (e.g. E712s – comparison with boolean).

Note: autopep8 offers an --aggressive argument (to ruthlessly “fix” these meaning-changing violations), but beware if you do use aggressive you may have to debug… (e.g. in numpy/pandas True == np.bool_(True) but not True is np.bool_(True)!)

You can check how many violations of each type (before and after):

pep8 --quiet --statistics .

Note: I consider E501s (line too long) are a special case as there will probably be a lot of these in your code and sometimes these are not corrected by autopep8.

As an example, I applied this technique to the pandas code base.

Question 28

@Andy Hayden gave good overview of autopep8. In addition to that there is one more package called pep8ify which also does the same thing.

However both packages can remove only lint errors but they cannot format code.

little = more[3:   5]

Above code remains same after pep8ifying also. But the code doesn’t look good yet. You can use formatters like yapf, which will format the code even if the code is PEP8 compliant. Above code will be formatted to

little = more[3:5]

Some times this even destroys Your manual formatting. For example

BAZ = {
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
}

will be converted to

BAZ = {[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]}

But You can tell it to ignore some parts.

BAZ = {
   [1, 2, 3, 4],
   [5, 6, 7, 8],
   [9, 10, 11, 12]
}  # yapf: disable

Taken from my old blog post: Automatically PEP8 & Format Your Python Code!

Question 29

If you’re using eclipse + PyDev you can simply activate autopep8 from PyDev’s settings: Windows -> Preferences -> type ‘autopep8’ in the search filter.

Check the ‘use autopep8.py for code formatting?’ -> OK

Now eclipse’s CTRL-SHIFT-F code formatting should format your code using autopep8 :)

Question 30

I made wide research about different instruments for python and code style. There are two types of instruments: linters – analyzing your code and give some warnings about bad used code styles and showing advices how to fix it, and code formatters – when you save your file it re-format your document using PEP style.

Because re-formatting must be more accurate – if it remorfat something that you don’t want it became useless – they cover less part of PEP, linters show much more.

All of them have different permissions for configuring – for example, pylinter configurable in all its rules (you can turn on/off every type of warnings), black unconfigurable at all.

Here are some useful links and tutorials:

Documentation:

PEP-257 Docstring Conventions: https://www.python.org/dev/peps/pep-0257/
PEP-484 Type Hint: https://www.python.org/dev/peps/pep-0484
Chromium Style Guide https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md
Code Style for autotest https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/docs/coding-style.md
Khan Academy Coding Style Guide https://github.com/Khan/style-guides/blob/master/style/python.md
The hitchhiker’s Guide to Python https://docs.python-guide.org/
EdX Python Style Guide https://edx.readthedocs.io/projects/edx-developer-guide/en/latest/style_guides/python-guidelines.html
Code Style Article on RealPython https://realpython.com/python-pep8/

Linters (in order of popularity):

mypy https://github.com/python/mypy linter for type checks (PEP-484)
pycodestyle https://github.com/PyCQA/pycodestyle – good one using PEP-8, very popular. Often used alongside of pylint and flake8 (simultaniously)
pylint https://github.com/PyCQA/pylint very configurable, actively supported
bandit https://github.com/PyCQA/bandit линтер по безопасности
prospector https://github.com/PyCQA/prospector pylint+code difficulty check
flake8 https://github.com/PyCQA/flake8 pycodestyle wrapper with ability to turn on plugins. Very big list of different configurable plugins. Here is awesome flake8 repo: https://github.com/DmytroLitvinov/awesome-flake8-extensions
wemake https://github.com/wemake-services/wemake-python-styleguide – trying to combine a lot of different linters in one project (really it is a flake8 plugin combining styles from several other linters)
pylama https://github.com/klen/pylama trying to combine 10 another linters in one (mypy, pylint, pycodeestyle, pydocstyle и др.). I can see the only one problem here – old version (no updates in github repo for about 10 months.)
pydocstyle https://github.com/PyCQA/pydocstyle docstrings linter (PEP-257)

Code formatters (in order of popularity):

black https://github.com/psf/black most populat formatter, used in several big companies. Was created later than yapf, but already has more starts at GitHub
yapf https://github.com/google/yapf Google code formatter
autopep8 https://github.com/hhatto/autopep8 build upon the pycodestyle

Question 31

There are many.

IDEs usually have some formatting capability built in. IntelliJ Idea / PyCharm does, same goes for the Python plugin for Eclipse, and so on.

There are formatters/linters that can target multiple languages. https://coala.io is a good example of those.

Then there are the single purpose tools, of which many are mentioned in other answers.

One specific method of automatic reformatting is to parse the file into AST tree (without dropping comments) and then dump it back to text (meaning nothing of the original formatting is preserved). Example of that would be https://github.com/python/black.

Question 32

A program that creates several processes that work on a join-able queue, Q, and may eventually manipulate a global dictionary D to store results. (so each child process may use D to store its result and also see what results the other child processes are producing)

If I print the dictionary D in a child process, I see the modifications that have been done on it (i.e. on D). But after the main process joins Q, if I print D, it’s an empty dict!

I understand it is a synchronization/lock issue. Can someone tell me what is happening here, and how I can synchronize access to D?

Question 33

A general answer involves using a Manager object. Adapted from the docs:

from multiprocessing import Process, Manager

def f(d):
    d[1] += '1'
    d['2'] += 2

if __name__ == '__main__':
    manager = Manager()

    d = manager.dict()
    d[1] = '1'
    d['2'] = 2

    p1 = Process(target=f, args=(d,))
    p2 = Process(target=f, args=(d,))
    p1.start()
    p2.start()
    p1.join()
    p2.join()

    print d

Output:

$ python mul.py 
{1: '111', '2': 6}

Question 34

multiprocessing is not like threading. Each child process will get a copy of the main process’s memory. Generally state is shared via communication (pipes/sockets), signals, or shared memory.

Multiprocessing makes some abstractions available for your use case – shared state that’s treated as local by use of proxies or shared memory: http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

Relevant sections:

Question 35

I’d like to share my own work that is faster than Manager’s dict and is simpler and more stable than pyshmht library that uses tons of memory and doesn’t work for Mac OS. Though my dict only works for plain strings and is immutable currently. I use linear probing implementation and store keys and values pairs in a separate memory block after the table.

from mmap import mmap
import struct
from timeit import default_timer
from multiprocessing import Manager
from pyshmht import HashTable


class shared_immutable_dict:
    def __init__(self, a):
        self.hs = 1 << (len(a) * 3).bit_length()
        kvp = self.hs * 4
        ht = [0xffffffff] * self.hs
        kvl = []
        for k, v in a.iteritems():
            h = self.hash(k)
            while ht[h] != 0xffffffff:
                h = (h + 1) & (self.hs - 1)
            ht[h] = kvp
            kvp += self.kvlen(k) + self.kvlen(v)
            kvl.append(k)
            kvl.append(v)

        self.m = mmap(-1, kvp)
        for p in ht:
            self.m.write(uint_format.pack(p))
        for x in kvl:
            if len(x) <= 0x7f:
                self.m.write_byte(chr(len(x)))
            else:
                self.m.write(uint_format.pack(0x80000000 + len(x)))
            self.m.write(x)

    def hash(self, k):
        h = hash(k)
        h = (h + (h >> 3) + (h >> 13) + (h >> 23)) * 1749375391 & (self.hs - 1)
        return h

    def get(self, k, d=None):
        h = self.hash(k)
        while True:
            x = uint_format.unpack(self.m[h * 4:h * 4 + 4])[0]
            if x == 0xffffffff:
                return d
            self.m.seek(x)
            if k == self.read_kv():
                return self.read_kv()
            h = (h + 1) & (self.hs - 1)

    def read_kv(self):
        sz = ord(self.m.read_byte())
        if sz & 0x80:
            sz = uint_format.unpack(chr(sz) + self.m.read(3))[0] - 0x80000000
        return self.m.read(sz)

    def kvlen(self, k):
        return len(k) + (1 if len(k) <= 0x7f else 4)

    def __contains__(self, k):
        return self.get(k, None) is not None

    def close(self):
        self.m.close()

uint_format = struct.Struct('>I')


def uget(a, k, d=None):
    return to_unicode(a.get(to_str(k), d))


def uin(a, k):
    return to_str(k) in a


def to_unicode(s):
    return s.decode('utf-8') if isinstance(s, str) else s


def to_str(s):
    return s.encode('utf-8') if isinstance(s, unicode) else s


def mmap_test():
    n = 1000000
    d = shared_immutable_dict({str(i * 2): '1' for i in xrange(n)})
    start_time = default_timer()
    for i in xrange(n):
        if bool(d.get(str(i))) != (i % 2 == 0):
            raise Exception(i)
    print 'mmap speed: %d gets per sec' % (n / (default_timer() - start_time))


def manager_test():
    n = 100000
    d = Manager().dict({str(i * 2): '1' for i in xrange(n)})
    start_time = default_timer()
    for i in xrange(n):
        if bool(d.get(str(i))) != (i % 2 == 0):
            raise Exception(i)
    print 'manager speed: %d gets per sec' % (n / (default_timer() - start_time))


def shm_test():
    n = 1000000
    d = HashTable('tmp', n)
    d.update({str(i * 2): '1' for i in xrange(n)})
    start_time = default_timer()
    for i in xrange(n):
        if bool(d.get(str(i))) != (i % 2 == 0):
            raise Exception(i)
    print 'shm speed: %d gets per sec' % (n / (default_timer() - start_time))


if __name__ == '__main__':
    mmap_test()
    manager_test()
    shm_test()

On my laptop performance results are:

mmap speed: 247288 gets per sec
manager speed: 33792 gets per sec
shm speed: 691332 gets per sec

simple usage example:

ht = shared_immutable_dict({'a': '1', 'b': '2'})
print ht.get('a')

Question 36

In addition to @senderle’s here, some might also be wondering how to use the functionality of multiprocessing.Pool.

The nice thing is that there is a .Pool() method to the manager instance that mimics all the familiar API of the top-level multiprocessing.

from itertools import repeat
import multiprocessing as mp
import os
import pprint

def f(d: dict) -> None:
    pid = os.getpid()
    d[pid] = "Hi, I was written by process %d" % pid

if __name__ == '__main__':
    with mp.Manager() as manager:
        d = manager.dict()
        with manager.Pool() as pool:
            pool.map(f, repeat(d, 10))
        # `d` is a DictProxy object that can be converted to dict
        pprint.pprint(dict(d))

Output:

$ python3 mul.py 
{22562: 'Hi, I was written by process 22562',
 22563: 'Hi, I was written by process 22563',
 22564: 'Hi, I was written by process 22564',
 22565: 'Hi, I was written by process 22565',
 22566: 'Hi, I was written by process 22566',
 22567: 'Hi, I was written by process 22567',
 22568: 'Hi, I was written by process 22568',
 22569: 'Hi, I was written by process 22569',
 22570: 'Hi, I was written by process 22570',
 22571: 'Hi, I was written by process 22571'}

This is a slightly different example where each process just logs its process ID to the global DictProxy object d.

Question 37

Maybe you can try pyshmht, sharing memory based hash table extension for Python.

Notice

It’s not fully tested, just for your reference.
It currently lacks lock/sem mechanisms for multiprocessing.

Question 38

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

Question 39

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

Question 40

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it’s not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want…

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Question 41

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Question 42

Did some performance testing on the various methods proposed (I won’t repeat them here). Some results:

str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb’s answer’s) = 0.8260340550004912
re.finditer (ninjagecko’s answer) = 0.698872097000276
str.find (one of Eli Collins’s answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams’s answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can’t see the use-case for short strings where memory isn’t an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.

Question 43

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I’ll just copy the docstring of the main str_split function:

str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Question 44

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

Question 45

I don’t see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you’re not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

Question 46

I wrote a version of @ninjagecko’s answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python’s regex module says that it does “the right thing” for unicode whitespace, but I haven’t actually tested it.

Also available as a gist.

Question 47

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Question 48

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.

Question 49

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

Question 50

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Question 51

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

Question 52

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

Question 53

For example, on July 5, 2010, I would like to calculate the string

 July 5, 2010

How should this be done?

Question 54

You can use the datetime module for working with dates and times in Python. The strftime method allows you to produce string representation of dates and times with a format you specify.

>>> import datetime
>>> datetime.date.today().strftime("%B %d, %Y")
'July 23, 2010'
>>> datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y")
'10:36AM on July 23, 2010'

Question 55

#python3

import datetime
print(
    '1: test-{date:%Y-%m-%d_%H:%M:%S}.txt'.format( date=datetime.datetime.now() )
    )

d = datetime.datetime.now()
print( "2a: {:%B %d, %Y}".format(d))

# see the f" to tell python this is a f string, no .format
print(f"2b: {d:%B %d, %Y}")

print(f"3: Today is {datetime.datetime.now():%Y-%m-%d} yay")

1: test-2018-02-14_16:40:52.txt

2a: March 04, 2018

2b: March 04, 2018

3: Today is 2018-11-11 yay

Description:

Using the new string format to inject value into a string at placeholder {}, value is the current time.

Then rather than just displaying the raw value as {}, use formatting to obtain the correct date format.

https://docs.python.org/3/library/string.html#formatexamples

Question 56

>>> import datetime
>>> now = datetime.datetime.now()
>>> now.strftime("%B %d, %Y")
'July 23, 2010'

Question 57

If you don’t care about formatting and you just need some quick date, you can use this:

import time
print(time.ctime())

Question 58

I have code which looks something like this:

thing_index = thing_list.index(thing)
otherfunction(thing_list, thing_index)

ok so that’s simplified but you get the idea. Now thing might not actually be in the list, in which case I want to pass -1 as thing_index. In other languages this is what you’d expect index() to return if it couldn’t find the element. In fact it throws a ValueError.

I could do this:

try:
    thing_index = thing_list.index(thing)
except ValueError:
    thing_index = -1
otherfunction(thing_list, thing_index)

But this feels dirty, plus I don’t know if ValueError could be raised for some other reason. I came up with the following solution based on generator functions, but it seems a little complex:

thing_index = ( [(i for i in xrange(len(thing_list)) if thing_list[i]==thing)] or [-1] )[0]

Is there a cleaner way to achieve the same thing? Let’s assume the list isn’t sorted.

Question 59

There is nothing “dirty” about using try-except clause. This is the pythonic way. ValueError will be raised by the .index method only, because it’s the only code you have there!

To answer the comment:
In Python, easier to ask forgiveness than to get permission philosophy is well established, and no index will not raise this type of error for any other issues. Not that I can think of any.

Question 60

thing_index = thing_list.index(elem) if elem in thing_list else -1

One line. Simple. No exceptions.

Question 61

The dict type has a get function, where if the key doesn’t exist in the dictionary, the 2nd argument to get is the value that it should return. Similarly there is setdefault, which returns the value in the dict if the key exists, otherwise it sets the value according to your default parameter and then returns your default parameter.

You could extend the list type to have a getindexdefault method.

class SuperDuperList(list):
    def getindexdefault(self, elem, default):
        try:
            thing_index = self.index(elem)
            return thing_index
        except ValueError:
            return default

Which could then be used like:

mylist = SuperDuperList([0,1,2])
index = mylist.getindexdefault( 'asdf', -1 )

Question 62

There is nothing wrong with your code that uses ValueError. Here’s yet another one-liner if you’d like to avoid exceptions:

thing_index = next((i for i, x in enumerate(thing_list) if x == thing), -1)

Question 63

This issue is one of language philosophy. In Java for example there has always been a tradition that exceptions should really only be used in “exceptional circumstances” that is when errors have happened, rather than for flow control. In the beginning this was for performance reasons as Java exceptions were slow but now this has become the accepted style.

In contrast Python has always used exceptions to indicate normal program flow, like raising a ValueError as we are discussing here. There is nothing “dirty” about this in Python style and there are many more where that came from. An even more common example is StopIteration exception which is raised by an iterator‘s next() method to signal that there are no further values.

Question 64

If you are doing this often then it is better to stove it away in a helper function:

def index_of(val, in_list):
    try:
        return in_list.index(val)
    except ValueError:
        return -1

问题：python脚本的文件名和行号

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：如何使用Django删除表中的所有数据

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：有没有办法从解释器的内存中删除创建的变量，函数等？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：将Python代码转换为符合PEP8的工具

回答 0

基本上使项目 比您发现的要好：

Basically leaving the project a little better than you found it:

回答 1

此时，我建议重新测试并进行提交！

您可以检查每种类型（前后）有多少次违规：

At this point I suggest retesting and doing a commit!

You can check how many violations of each type (before and after):

回答 2

回答 3

回答 4

回答 5

问题：多重处理：如何在多个流程之间共享一个字典？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：Python中是否有`string.split（）`的生成器版本？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

问题：如何在python中获取当前日期时间的字符串格式？

回答 0

回答 1

回答 2

回答 3

问题：在python中处理list.index（可能不存在）的最佳方法？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：从Python中的打开文件获取路径

回答 0

回答 1

回答 2

基本上使项目比您发现的要好：