标签归档:file

在Python中,“虽然不是EOF”的完美替代品是什么?

问题:在Python中,“虽然不是EOF”的完美替代品是什么?

要读取一些文本文件,无论是C还是Pascal,我始终使用以下代码段读取数据,直到EOF:

while not eof do begin
  readline(a);
  do_something;
end;

因此,我想知道如何在Python中简单快速地做到这一点?

To read some text file, in C or Pascal, I always use the following snippets to read the data until EOF:

while not eof do begin
  readline(a);
  do_something;
end;

Thus, I wonder how can I do this simple and fast in Python?


回答 0

循环遍历文件以读取行:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

文件对象是可迭代的,并在EOF之前产生行。将文件对象用作可迭代对象使用缓冲区来确保性能读取。

您可以使用stdin进行相同操作(无需使用raw_input()

import sys

for line in sys.stdin:
    do_something()

为了完成图片,可以使用以下方式进行二进制读取:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

其中chunk将包含多达1024个字节从文件中的时间,而当迭代停止openfileobject.read(1024)开始使空字节字符串。

Loop over the file to read lines:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

File objects are iterable and yield lines until EOF. Using the file object as an iterable uses a buffer to ensure performant reads.

You can do the same with the stdin (no need to use raw_input():

import sys

for line in sys.stdin:
    do_something()

To complete the picture, binary reads can be done with:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

where chunk will contain up to 1024 bytes at a time from the file, and iteration stops when openfileobject.read(1024) starts returning empty byte strings.


回答 1

您可以在Python中模仿C语言。

要读取不超过max_size字节数的缓冲区,可以执行以下操作:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

或者,一行一行地显示文本文件:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

您需要使用while True / break构造函数,因为除了缺少读取返回的字节以外,Python中没有eof测试

在C语言中,您可能具有:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

但是,您不能在Python中使用此功能:

 while (line = f.readline()):
     # syntax error

因为在Python的表达式不允许赋值(尽管Python的最新版本可以使用赋值表达式来模仿它,请参见下文)。

在Python中这样做当然惯用了:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

更新:从Python 3.8开始,您还可以使用赋值表达式

 while line := f.readline():
     process(line)

You can imitate the C idiom in Python.

To read a buffer up to max_size number of bytes, you can do this:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

Or, a text file line by line:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

You need to use while True / break construct since there is no eof test in Python other than the lack of bytes returned from a read.

In C, you might have:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

However, you cannot have this in Python:

 while (line = f.readline()):
     # syntax error

because assignments are not allowed in expressions in Python (although recent versions of Python can mimic this using assignment expressions, see below).

It is certainly more idiomatic in Python to do this:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

Update: Since Python 3.8 you may also use assignment expressions:

 while line := f.readline():
     process(line)

回答 2

用于打开文件并逐行读取的Python习惯用法是:

with open('filename') as f:
    for line in f:
        do_something(line)

该文件将在上述代码的末尾自动关闭(该with结构将完成此工作)。

最后,值得注意的是line将保留尾随的换行符。可以使用以下方法轻松删除它:

line = line.rstrip()

The Python idiom for opening a file and reading it line-by-line is:

with open('filename') as f:
    for line in f:
        do_something(line)

The file will be automatically closed at the end of the above code (the with construct takes care of that).

Finally, it is worth noting that line will preserve the trailing newline. This can be easily removed using:

line = line.rstrip()

回答 3

您可以使用下面的代码片段逐行读取,直到文件结尾

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

You can use below code snippet to read line by line, till end of file

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

回答 4

尽管上面有“以python方式实现”的建议,但如果真的想有一个基于EOF的逻辑,那么我想使用异常处理是做到这一点的方法-

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

例:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

或者按Ctrl-Zraw_input()提示符(Windows,Ctrl-ZLinux的)

While there are suggestions above for “doing it the python way”, if one wants to really have a logic based on EOF, then I suppose using exception handling is the way to do it —

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

Example:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

Or press Ctrl-Z at a raw_input() prompt (Windows, Ctrl-Z Linux)


回答 5

您可以使用以下代码段。readlines()一次读取整个文件并按行分割。

line = obj.readlines()

You can use the following code snippet. readlines() reads in the whole file at once and splits it by line.

line = obj.readlines()

回答 6

除了@dawg的好答案之外,使用walrus运算符的等效解决方案(Python> = 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

In addition to @dawg’s great answer, the equivalent solution using walrus operator (Python >= 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

python脚本的文件名和行号

问题:python脚本的文件名和行号

如何在python脚本中获取文件名和行号。

正是从异常回溯中获得的文件信息。在这种情况下不会引发异常。

How can I get the file name and line number in python script.

Exactly the file information we get from an exception traceback. In this case without raising an exception.


回答 0

感谢mcandre,答案是:

#python3
from inspect import currentframe, getframeinfo

frameinfo = getframeinfo(currentframe())

print(frameinfo.filename, frameinfo.lineno)

Thanks to mcandre, the answer is:

#python3
from inspect import currentframe, getframeinfo

frameinfo = getframeinfo(currentframe())

print(frameinfo.filename, frameinfo.lineno)

回答 1

是否使用currentframe().f_back取决于是否使用功能。

直接调用检查:

from inspect import currentframe, getframeinfo

cf = currentframe()
filename = getframeinfo(cf).filename

print "This is line 5, python says line ", cf.f_lineno 
print "The filename is ", filename

调用为您执行此操作的函数:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno

print "This is line 7, python says line ", get_linenumber()

Whether you use currentframe().f_back depends on whether you are using a function or not.

Calling inspect directly:

from inspect import currentframe, getframeinfo

cf = currentframe()
filename = getframeinfo(cf).filename

print "This is line 5, python says line ", cf.f_lineno 
print "The filename is ", filename

Calling a function that does it for you:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno

print "This is line 7, python says line ", get_linenumber()

回答 2

如果在公用文件中使用,则非常方便-打印文件名,行号和调用方法的功能:

import inspect
def getLineInfo():
    print(inspect.stack()[1][1],":",inspect.stack()[1][2],":",
          inspect.stack()[1][3])

Handy if used in a common file – prints file name, line number and function of the caller:

import inspect
def getLineInfo():
    print(inspect.stack()[1][1],":",inspect.stack()[1][2],":",
          inspect.stack()[1][3])

回答 3

档案名称

__file__
# or
sys.argv[0]

线

inspect.currentframe().f_lineno

(不是inspect.currentframe().f_back.f_lineno上面提到的)

Filename:

__file__
# or
sys.argv[0]

Line:

inspect.currentframe().f_lineno

(not inspect.currentframe().f_back.f_lineno as mentioned above)


回答 4

最好也使用sys-

print dir(sys._getframe())
print dir(sys._getframe().f_lineno)
print sys._getframe().f_lineno

输出为:

['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'f_back', 'f_builtins', 'f_code', 'f_exc_traceback', 'f_exc_type', 'f_exc_value', 'f_globals', 'f_lasti', 'f_lineno', 'f_locals', 'f_restricted', 'f_trace']
['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__', '__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__', '__format__', '__getattribute__', '__getnewargs__', '__hash__', '__hex__', '__index__', '__init__', '__int__', '__invert__', '__long__', '__lshift__', '__mod__', '__mul__', '__neg__', '__new__', '__nonzero__', '__oct__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__trunc__', '__xor__', 'bit_length', 'conjugate', 'denominator', 'imag', 'numerator', 'real']
14

Better to use sys also-

print dir(sys._getframe())
print dir(sys._getframe().f_lineno)
print sys._getframe().f_lineno

The output is:

['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'f_back', 'f_builtins', 'f_code', 'f_exc_traceback', 'f_exc_type', 'f_exc_value', 'f_globals', 'f_lasti', 'f_lineno', 'f_locals', 'f_restricted', 'f_trace']
['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__', '__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__', '__format__', '__getattribute__', '__getnewargs__', '__hash__', '__hex__', '__index__', '__init__', '__int__', '__invert__', '__long__', '__lshift__', '__mod__', '__mul__', '__neg__', '__new__', '__nonzero__', '__oct__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__trunc__', '__xor__', 'bit_length', 'conjugate', 'denominator', 'imag', 'numerator', 'real']
14

回答 5

只是为了贡献,

linecachepython中有一个模块,这里有两个链接可以提供帮助。

linecache模块文档
linecache源代码

从某种意义上讲,您可以将整个文件“转储”到其缓存中,并使用class中的linecache.cache数据读取它。

import linecache as allLines
## have in mind that fileName in linecache behaves as any other open statement, you will need a path to a file if file is not in the same directory as script
linesList = allLines.updatechache( fileName ,None)
for i,x in enumerate(lineslist): print(i,x) #prints the line number and content
#or for more info
print(line.cache)
#or you need a specific line
specLine = allLines.getline(fileName,numbOfLine)
#returns a textual line from that number of line

对于其他信息,为了进行错误处理,您可以简单地使用

from sys import exc_info
try:
     raise YourError # or some other error
except Exception:
     print(exc_info() )

Just to contribute,

there is a linecache module in python, here is two links that can help.

linecache module documentation
linecache source code

In a sense, you can “dump” a whole file into its cache , and read it with linecache.cache data from class.

import linecache as allLines
## have in mind that fileName in linecache behaves as any other open statement, you will need a path to a file if file is not in the same directory as script
linesList = allLines.updatechache( fileName ,None)
for i,x in enumerate(lineslist): print(i,x) #prints the line number and content
#or for more info
print(line.cache)
#or you need a specific line
specLine = allLines.getline(fileName,numbOfLine)
#returns a textual line from that number of line

For additional info, for error handling, you can simply use

from sys import exc_info
try:
     raise YourError # or some other error
except Exception:
     print(exc_info() )

回答 6

import inspect    

file_name = __FILE__
current_line_no = inspect.stack()[0][2]
current_function_name = inspect.stack()[0][3]

#Try printing inspect.stack() you can see current stack and pick whatever you want 
import inspect    

file_name = __FILE__
current_line_no = inspect.stack()[0][2]
current_function_name = inspect.stack()[0][3]

#Try printing inspect.stack() you can see current stack and pick whatever you want 

回答 7

在Python 3中,您可以在以下方面使用变体:

def Deb(msg = None):
  print(f"Debug {sys._getframe().f_back.f_lineno}: {msg if msg is not None else ''}")

在代码中,您可以使用:

Deb("Some useful information")
Deb()

生产:

123: Some useful information
124:

其中123和124是进行呼叫的线路。

In Python 3 you can use a variation on:

def Deb(msg = None):
  print(f"Debug {sys._getframe().f_back.f_lineno}: {msg if msg is not None else ''}")

In code, you can then use:

Deb("Some useful information")
Deb()

To produce:

123: Some useful information
124:

Where the 123 and 124 are the lines that the calls are made from.


回答 8

dmsg是我在VSCode 1.39.2中在Python 3.7.3中获取行号的方法(这是我的调试消息助记符):

import inspect

def dmsg(text_s):
    print (str(inspect.currentframe().f_back.f_lineno) + '| ' + text_s)

调用显示变量name_s及其值:

name_s = put_code_here
dmsg('name_s: ' + name_s)

输出看起来像这样:

37| name_s: value_of_variable_at_line_37

Here’s what works for me to get the line number in Python 3.7.3 in VSCode 1.39.2 (dmsg is my mnemonic for debug message):

import inspect

def dmsg(text_s):
    print (str(inspect.currentframe().f_back.f_lineno) + '| ' + text_s)

To call showing a variable name_s and its value:

name_s = put_code_here
dmsg('name_s: ' + name_s)

Output looks like this:

37| name_s: value_of_variable_at_line_37

Python 2.7:打印到文件

问题:Python 2.7:打印到文件

为什么尝试直接打印到文件而不是sys.stdout产生以下语法错误:

Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f1=open('./testfile', 'w+')
>>> print('This is a test', file=f1)
  File "<stdin>", line 1
    print('This is a test', file=f1)
                            ^
SyntaxError: invalid syntax

从帮助(__builtins__),我有以下信息:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout)

    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file: a file-like object (stream); defaults to the current sys.stdout.
    sep:  string inserted between values, default a space.
    end:  string appended after the last value, default a newline.

那么,将标准流打印内容写入更改的正确语法是什么?

我知道有不同的也许更好的写入文件的方法,但是我真的不明白为什么这应该是语法错误…

一个很好的解释将不胜感激!

Why does trying to print directly to a file instead of sys.stdout produce the following syntax error:

Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f1=open('./testfile', 'w+')
>>> print('This is a test', file=f1)
  File "<stdin>", line 1
    print('This is a test', file=f1)
                            ^
SyntaxError: invalid syntax

From help(__builtins__) I have the following info:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout)

    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file: a file-like object (stream); defaults to the current sys.stdout.
    sep:  string inserted between values, default a space.
    end:  string appended after the last value, default a newline.

So what would be the right syntax to change the standard stream print writes to?

I know that there are different maybe better ways to write to file but I really don’t get why this should be a syntax error…

A nice explanation would be appreciated!


回答 0

如果要print在Python 2中使用该函数,则必须从导入__future__

from __future__ import print_function

但是,即使不使用该函数,也可以达到相同的效果:

print >>f1, 'This is a test'

If you want to use the print function in Python 2, you have to import from __future__:

from __future__ import print_function

But you can have the same effect without using the function, too:

print >>f1, 'This is a test'

回答 1

print是python 2.X中的关键字。您应该使用以下内容:

f1=open('./testfile', 'w+')
f1.write('This is a test')
f1.close()

print is a keyword in python 2.X. You should use the following:

f1=open('./testfile', 'w+')
f1.write('This is a test')
f1.close()

回答 2

print(args, file=f1)是python 3.x语法。对于python 2.x使用print >> f1, args

print(args, file=f1) is the python 3.x syntax. For python 2.x use print >> f1, args.


回答 3

您可以将打印语句导出到文件,而无需更改任何代码。只需打开终端窗口并以这种方式运行代码:

python yourcode.py >> log.txt

You can export print statement to file without changing any code. Simply open a terminal windows and run your code in this way:

python yourcode.py >> log.txt

回答 4

这会将您的“打印”输出重定向到文件:

import sys
sys.stdout = open("file.txt", "w+")
print "this line will redirect to file.txt"

This will redirect your ‘print’ output to a file:

import sys
sys.stdout = open("file.txt", "w+")
print "this line will redirect to file.txt"

回答 5

在Python 3.0+中,print是一个函数,您可以使用调用它print(...)。在较早的版本中,print是一个声明,您可以使用进行声明print ...

要在Python 3.0之前的版本中打印到文件,请执行以下操作:

print >> f, 'what ever %d', i

>>操作指引打印到文件f

In Python 3.0+, print is a function, which you’d call with print(...). In earlier version, print is a statement, which you’d make with print ....

To print to a file in Python earlier than 3.0, you’d do:

print >> f, 'what ever %d', i

The >> operator directs print to the file f.


在Python中复制多个文件

问题:在Python中复制多个文件

如何使用Python将一个目录中存在的所有文件复制到另一目录中。我有源路径和目标路径作为字符串。

How to copy all the files present in one directory to another directory using Python. I have the source path and the destination path as string.


回答 0

您可以使用os.listdir()来获取源目录中的文件,使用os.path.isfile()来查看它们是否为常规文件(包括* nix系统上的符号链接),然后使用shutil.copy进行复制。

以下代码仅将常规文件从源目录复制到目标目录(我假设您不希望复制任何子目录)。

import os
import shutil
src_files = os.listdir(src)
for file_name in src_files:
    full_file_name = os.path.join(src, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, dest)

You can use os.listdir() to get the files in the source directory, os.path.isfile() to see if they are regular files (including symbolic links on *nix systems), and shutil.copy to do the copying.

The following code copies only the regular files from the source directory into the destination directory (I’m assuming you don’t want any sub-directories copied).

import os
import shutil
src_files = os.listdir(src)
for file_name in src_files:
    full_file_name = os.path.join(src, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, dest)

回答 1

如果您不想复制整个树(带有子目录等),请使用或glob.glob("path/to/dir/*.*")获取所有文件名的列表,遍历该列表并用于shutil.copy复制每个文件。

for filename in glob.glob(os.path.join(source_dir, '*.*')):
    shutil.copy(filename, dest_dir)

If you don’t want to copy the whole tree (with subdirs etc), use or glob.glob("path/to/dir/*.*") to get a list of all the filenames, loop over the list and use shutil.copy to copy each file.

for filename in glob.glob(os.path.join(source_dir, '*.*')):
    shutil.copy(filename, dest_dir)

回答 2

查看Python文档中shutil,特别是copytree命令。

Look at shutil in the Python docs, specifically the copytree command.

If the destination directory already exists, try:

shutil.copytree(source, destination, dirs_exist_ok=True)

回答 3

def recursive_copy_files(source_path, destination_path, override=False):
    """
    Recursive copies files from source  to destination directory.
    :param source_path: source directory
    :param destination_path: destination directory
    :param override if True all files will be overridden otherwise skip if file exist
    :return: count of copied files
    """
    files_count = 0
    if not os.path.exists(destination_path):
        os.mkdir(destination_path)
    items = glob.glob(source_path + '/*')
    for item in items:
        if os.path.isdir(item):
            path = os.path.join(destination_path, item.split('/')[-1])
            files_count += recursive_copy_files(source_path=item, destination_path=path, override=override)
        else:
            file = os.path.join(destination_path, item.split('/')[-1])
            if not os.path.exists(file) or override:
                shutil.copyfile(item, file)
                files_count += 1
    return files_count
def recursive_copy_files(source_path, destination_path, override=False):
    """
    Recursive copies files from source  to destination directory.
    :param source_path: source directory
    :param destination_path: destination directory
    :param override if True all files will be overridden otherwise skip if file exist
    :return: count of copied files
    """
    files_count = 0
    if not os.path.exists(destination_path):
        os.mkdir(destination_path)
    items = glob.glob(source_path + '/*')
    for item in items:
        if os.path.isdir(item):
            path = os.path.join(destination_path, item.split('/')[-1])
            files_count += recursive_copy_files(source_path=item, destination_path=path, override=override)
        else:
            file = os.path.join(destination_path, item.split('/')[-1])
            if not os.path.exists(file) or override:
                shutil.copyfile(item, file)
                files_count += 1
    return files_count

回答 4

import os
import shutil
os.chdir('C:\\') #Make sure you add your source and destination path below

dir_src = ("C:\\foooo\\")
dir_dst = ("C:\\toooo\\")

for filename in os.listdir(dir_src):
    if filename.endswith('.txt'):
        shutil.copy( dir_src + filename, dir_dst)
    print(filename)
import os
import shutil
os.chdir('C:\\') #Make sure you add your source and destination path below

dir_src = ("C:\\foooo\\")
dir_dst = ("C:\\toooo\\")

for filename in os.listdir(dir_src):
    if filename.endswith('.txt'):
        shutil.copy( dir_src + filename, dir_dst)
    print(filename)

回答 5

这是递归复制功能的另一个示例,该函数使您可以一次复制一个文件的目录(包括子目录)的内容,这是我用来解决此问题的方法。

import os
import shutil

def recursive_copy(src, dest):
    """
    Copy each file from src dir to dest dir, including sub-directories.
    """
    for item in os.listdir(src):
        file_path = os.path.join(src, item)

        # if item is a file, copy it
        if os.path.isfile(file_path):
            shutil.copy(file_path, dest)

        # else if item is a folder, recurse 
        elif os.path.isdir(file_path):
            new_dest = os.path.join(dest, item)
            os.mkdir(new_dest)
            recursive_copy(file_path, new_dest)

编辑:如果可以,绝对可以使用shutil.copytree(src, dest)。这要求目标文件夹尚不存在。如果您需要将文件复制到现有文件夹中,则上述方法效果很好!

Here is another example of a recursive copy function that lets you copy the contents of the directory (including sub-directories) one file at a time, which I used to solve this problem.

import os
import shutil

def recursive_copy(src, dest):
    """
    Copy each file from src dir to dest dir, including sub-directories.
    """
    for item in os.listdir(src):
        file_path = os.path.join(src, item)

        # if item is a file, copy it
        if os.path.isfile(file_path):
            shutil.copy(file_path, dest)

        # else if item is a folder, recurse 
        elif os.path.isdir(file_path):
            new_dest = os.path.join(dest, item)
            os.mkdir(new_dest)
            recursive_copy(file_path, new_dest)

EDIT: If you can, definitely just use shutil.copytree(src, dest). This requires that that destination folder does not already exist though. If you need to copy files into an existing folder, the above method works well!


Python模块os.chmod(file,664)不会将权限更改为rw-rw-r,而是-w–wx —-

问题:Python模块os.chmod(file,664)不会将权限更改为rw-rw-r,而是-w–wx —-

最近,我正在使用Python模块os,当我尝试更改文件的权限时,没有得到预期的结果。例如,我打算将权限更改为rw-rw-r–,

os.chmod("/tmp/test_file", 664)

所有权许可实际上是-w–wx —(230)

--w--wx--- 1 ag ag 0 Mar 25 05:45 test_file

但是,如果我在代码中将664更改为0664,则结果正是我所需要的,例如

os.chmod("/tmp/test_file", 0664)

结果是:

-rw-rw-r-- 1 ag ag 0 Mar 25 05:55 test_file

任何人都可以帮助解释为什么前导0对于获得正确结果如此重要吗?

Recently I am using Python module os, when I tried to change the permission of a file, I did not get the expected result. For example, I intended to change the permission to rw-rw-r–,

os.chmod("/tmp/test_file", 664)

The ownership permission is actually -w–wx— (230)

--w--wx--- 1 ag ag 0 Mar 25 05:45 test_file

However, if I change 664 to 0664 in the code, the result is just what I need, e.g.

os.chmod("/tmp/test_file", 0664)

The result is:

-rw-rw-r-- 1 ag ag 0 Mar 25 05:55 test_file

Could anybody help explaining why does that leading 0 is so important to get the correct result?


回答 0

其他论坛上找到了这个

如果您想知道为什么前导零很重要,那是因为将权限设置为八进制整数,Python自动将任何带有前导零的整数视为八进制。因此os.chmod(“ file”,484)(十进制)将给出相同的结果。

您正在做的是通过664八进制的1230

在您的情况下,您将需要

os.chmod("/tmp/test_file", 436)

[更新]请注意,对于Python 3,您的前缀为0o(零哦)。例如,0o666

Found this on a different forum

If you’re wondering why that leading zero is important, it’s because permissions are set as an octal integer, and Python automagically treats any integer with a leading zero as octal. So os.chmod(“file”, 484) (in decimal) would give the same result.

What you are doing is passing 664 which in octal is 1230

In your case you would need

os.chmod("/tmp/test_file", 436)

[Update] Note, for Python 3 you have prefix with 0o (zero oh). E.G, 0o666


回答 1

因此,对于那些想要类似语义的人:

$ chmod 755 somefile

用:

$ python -c "import os; os.chmod('somefile', 0o755)"

如果您的Python早于2.6:

$ python -c "import os; os.chmod('somefile', 0755)"

So for people who want semantics similar to:

$ chmod 755 somefile

Use:

$ python -c "import os; os.chmod('somefile', 0o755)"

If your Python is older than 2.6:

$ python -c "import os; os.chmod('somefile', 0755)"

回答 2

前导0意味着这是八进制常数,而不是十进制数。并且您需要八进制更改文件模式。

权限是一个位掩码,例如,rwxrwx---111111000二进制的,并且将位除以3来转换为八进制,这比计算十进制表示形式要容易得多。

0644(八进制)为0.110.100.100二进制(为了可读性,添加了点),或者,如您所计算的,420十进制。

leading 0 means this is octal constant, not the decimal one. and you need an octal to change file mode.

permissions are a bit mask, for example, rwxrwx--- is 111111000 in binary, and it’s very easy to group bits by 3 to convert to the octal, than calculate the decimal representation.

0644 (octal) is 0.110.100.100 in binary (i’ve added dots for readability), or, as you may calculate, 420 in decimal.


回答 3

使用权限符号代替数字

如果您使用了语义上更命名的权限符号而不是原始的魔术数字,则可以避免您的问题,例如664

#!/usr/bin/env python3

import os
import stat

os.chmod(
    'myfile',
    stat.S_IRUSR |
    stat.S_IWUSR |
    stat.S_IRGRP |
    stat.S_IWGRP |
    stat.S_IROTH
)

https://docs.python.org/3/library/os.html#os.chmod中对此进行了记录,其名称与在处记录的POSIX C API值相同man 2 stat

另一个优点是文档中提到的更大的可移植性:

注意:尽管Windows支持chmod(),但您只能使用它设置文件的只读标志(通过stat.S_IWRITEstat.S_IREAD常数或相应的整数值)。所有其他位均被忽略。

chmod +x演示于:如何在python中执行简单的“ chmod + x”?

已在Ubuntu 16.04,Python 3.5.2中进行了测试。

Use permission symbols instead of numbers

Your problem would have been avoided if you had used the more semantically named permission symbols rather than raw magic numbers, e.g. for 664:

#!/usr/bin/env python3

import os
import stat

os.chmod(
    'myfile',
    stat.S_IRUSR |
    stat.S_IWUSR |
    stat.S_IRGRP |
    stat.S_IWGRP |
    stat.S_IROTH
)

This is documented at https://docs.python.org/3/library/os.html#os.chmod and the names are the same as the POSIX C API values documented at man 2 stat.

Another advantage is the greater portability as mentioned in the docs:

Note: Although Windows supports chmod(), you can only set the file’s read-only flag with it (via the stat.S_IWRITE and stat.S_IREAD constants or a corresponding integer value). All other bits are ignored.

chmod +x is demonstrated at: How do you do a simple “chmod +x” from within python?

Tested in Ubuntu 16.04, Python 3.5.2.


回答 4

如果您已将所需的权限保存到字符串,请执行

s = '660'
os.chmod(file_path, int(s, base=8))

If you have desired permissions saved to string then do

s = '660'
os.chmod(file_path, int(s, base=8))

回答 5

在我看来,使用stat。*位掩码似乎是最方便,最明确的方式。但另一方面,我经常忘记如何最好地处理该问题。因此,这是一个掩盖“组”和“其他”权限并保持“所有者”权限不变的示例。使用位掩码和减法是一种有用的模式。

import os
import stat
def chmodme(pn):
    """Removes 'group' and 'other' perms. Doesn't touch 'owner' perms."""
    mode = os.stat(pn).st_mode
    mode -= (mode & (stat.S_IRWXG | stat.S_IRWXO))
    os.chmod(pn, mode)

Using the stat.* bit masks does seem to me the most portable and explicit way of doing this. But on the other hand, I often forget how best to handle that. So, here’s an example of masking out the ‘group’ and ‘other’ permissions and leaving ‘owner’ permissions untouched. Using bitmasks and subtraction is a useful pattern.

import os
import stat
def chmodme(pn):
    """Removes 'group' and 'other' perms. Doesn't touch 'owner' perms."""
    mode = os.stat(pn).st_mode
    mode -= (mode & (stat.S_IRWXG | stat.S_IRWXO))
    os.chmod(pn, mode)

为什么在读取一个空文件时出现“ Pickle-EOFError:Ran out of input”的问题?

问题:为什么在读取一个空文件时出现“ Pickle-EOFError:Ran out of input”的问题?

尝试使用时出现一个有趣的错误Unpickler.load(),这是源代码:

open(target, 'a').close()
scores = {};
with open(target, "rb") as file:
    unpickler = pickle.Unpickler(file);
    scores = unpickler.load();
    if not isinstance(scores, dict):
        scores = {};

这是回溯:

Traceback (most recent call last):
File "G:\python\pendu\user_test.py", line 3, in <module>:
    save_user_points("Magix", 30);
File "G:\python\pendu\user.py", line 22, in save_user_points:
    scores = unpickler.load();
EOFError: Ran out of input

我尝试读取的文件为空。如何避免出现此错误,而是获取一个空变量?

I am getting an interesting error while trying to use Unpickler.load(), here is the source code:

open(target, 'a').close()
scores = {};
with open(target, "rb") as file:
    unpickler = pickle.Unpickler(file);
    scores = unpickler.load();
    if not isinstance(scores, dict):
        scores = {};

Here is the traceback:

Traceback (most recent call last):
File "G:\python\pendu\user_test.py", line 3, in <module>:
    save_user_points("Magix", 30);
File "G:\python\pendu\user.py", line 22, in save_user_points:
    scores = unpickler.load();
EOFError: Ran out of input

The file I am trying to read is empty. How can I avoid getting this error, and get an empty variable instead?


回答 0

我先检查文件是否为空:

import os

scores = {} # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        scores = unpickler.load()

而且open(target, 'a').close()在您的代码中什么也不做,您不需要使用;

I would check that the file is not empty first:

import os

scores = {} # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        scores = unpickler.load()

Also open(target, 'a').close() is doing nothing in your code and you don’t need to use ;.


回答 1

这里的大多数答案都涉及如何处理EOFError异常,如果您不确定腌制的对象是否为空,这将非常方便。

但是,如果您对泡菜文件为空感到惊讶,那可能是因为您通过“ wb”或其他可能覆盖了文件的模式打开了文件名。

例如:

filename = 'cd.pkl'
with open(filename, 'wb') as f:
    classification_dict = pickle.load(f)

这将覆盖腌制的文件。在使用之前,您可能会错误地这样做:

...
open(filename, 'rb') as f:

然后得到EOFError,因为上一个代码块重写了cd.pkl文件。

在Jupyter或控制台(Spyder)中工作时,我通常会在读取/写入代码上编写一个包装器,然后再调用该包装器。这样可以避免常见的读写错误,并且如果您要多次通过同一个文件读取同一文件,可以节省一些时间。

Most of the answers here have dealt with how to mange EOFError exceptions, which is really handy if you’re unsure about whether the pickled object is empty or not.

However, if you’re surprised that the pickle file is empty, it could be because you opened the filename through ‘wb’ or some other mode that could have over-written the file.

for example:

filename = 'cd.pkl'
with open(filename, 'wb') as f:
    classification_dict = pickle.load(f)

This will over-write the pickled file. You might have done this by mistake before using:

...
open(filename, 'rb') as f:

And then got the EOFError because the previous block of code over-wrote the cd.pkl file.

When working in Jupyter, or in the console (Spyder) I usually write a wrapper over the reading/writing code, and call the wrapper subsequently. This avoids common read-write mistakes, and saves a bit of time if you’re going to be reading the same file multiple times through your travails


回答 2

如您所见,这实际上是一个自然错误..

从Unpickler对象读取的典型构造如下:

try:
    data = unpickler.load()
except EOFError:
    data = list()  # or whatever you want

只是引发EOFError,因为它正在读取一个空文件,这仅表示文件末尾 ..

As you see, that’s actually a natural error ..

A typical construct for reading from an Unpickler object would be like this ..

try:
    data = unpickler.load()
except EOFError:
    data = list()  # or whatever you want

EOFError is simply raised, because it was reading an empty file, it just meant End of File ..


回答 3

腌制的文件很可能为空。

如果要复制和粘贴代码,则覆盖腌制文件非常容易。

例如,以下内容编写了一个pickle文件:

pickle.dump(df,open('df.p','wb'))

而且,如果您复制以下代码以重新打开它,但忘记更改'wb''rb',则将覆盖文件:

df=pickle.load(open('df.p','rb'))

正确的语法是

df=pickle.load(open('df.p','wb'))

It is very likely that the pickled file is empty.

It is surprisingly easy to overwrite a pickle file if you’re copying and pasting code.

For example the following writes a pickle file:

pickle.dump(df,open('df.p','wb'))

And if you copied this code to reopen it, but forgot to change 'wb' to 'rb' then you would overwrite the file:

df=pickle.load(open('df.p','wb'))

The correct syntax is

df=pickle.load(open('df.p','rb'))

回答 4

if path.exists(Score_file):
      try : 
         with open(Score_file , "rb") as prev_Scr:

            return Unpickler(prev_Scr).load()

    except EOFError : 

        return dict() 
if path.exists(Score_file):
      try : 
         with open(Score_file , "rb") as prev_Scr:

            return Unpickler(prev_Scr).load()

    except EOFError : 

        return dict() 

回答 5

您可以捕获该异常并从那里返回您想要的任何东西。

open(target, 'a').close()
scores = {};
try:
    with open(target, "rb") as file:
        unpickler = pickle.Unpickler(file);
        scores = unpickler.load();
        if not isinstance(scores, dict):
            scores = {};
except EOFError:
    return {}

You can catch that exception and return whatever you want from there.

open(target, 'a').close()
scores = {};
try:
    with open(target, "rb") as file:
        unpickler = pickle.Unpickler(file);
        scores = unpickler.load();
        if not isinstance(scores, dict):
            scores = {};
except EOFError:
    return {}

回答 6

请注意,打开文件的模式为’a’或其他一些带有字母’a’的模式也会由于覆盖过多而出错。

pointer = open('makeaafile.txt', 'ab+')
tes = pickle.load(pointer, encoding='utf-8')

Note that the mode of opening files is ‘a’ or some other have alphabet ‘a’ will also make error because of the overwritting.

pointer = open('makeaafile.txt', 'ab+')
tes = pickle.load(pointer, encoding='utf-8')

在Python中读取和覆盖文件

问题:在Python中读取和覆盖文件

目前,我正在使用此:

f = open(filename, 'r+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.write(text)
f.close()

但是问题在于旧文件大于新文件。因此,我最终得到一个新文件,该文件的末尾包含旧文件的一部分。

Currently I’m using this:

f = open(filename, 'r+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.write(text)
f.close()

But the problem is that the old file is larger than the new file. So I end up with a new file that has a part of the old file on the end of it.


回答 0

如果您不想关闭并重新打开文件,为避免出现竞争情况,可以truncate这样做:

f = open(filename, 'r+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.write(text)
f.truncate()
f.close()

该功能将很可能也更清洁和更安全的使用open作为一个上下文管理器,这将关闭该文件处理程序,即使出现错误!

with open(filename, 'r+') as f:
    text = f.read()
    text = re.sub('foobar', 'bar', text)
    f.seek(0)
    f.write(text)
    f.truncate()

If you don’t want to close and reopen the file, to avoid race conditions, you could truncate it:

f = open(filename, 'r+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.write(text)
f.truncate()
f.close()

The functionality will likely also be cleaner and safer using open as a context manager, which will close the file handler, even if an error occurs!

with open(filename, 'r+') as f:
    text = f.read()
    text = re.sub('foobar', 'bar', text)
    f.seek(0)
    f.write(text)
    f.truncate()

回答 1

在关闭文件之后text = re.sub('foobar', 'bar', text),重新打开文件以进行写入(从而清除旧内容),然后将更新后的文本写入其中,可能会更容易更整洁。

Probably it would be easier and neater to close the file after text = re.sub('foobar', 'bar', text), re-open it for writing (thus clearing old contents), and write your updated text to it.


回答 2

fileinput模块提供了一种inline模式,用于在不使用临时文件等的情况下将更改写入正在处理的文件。该模块很好地封装了通过对象透明地跟踪文件名来循环遍历文件列表中的行的常见操作,行号等,如果您想在循环中检查它们。

import fileinput
for line in fileinput.FileInput("file",inplace=1):
    if "foobar" in line:
         line=line.replace("foobar","bar")
    print line

The fileinput module has an inplace mode for writing changes to the file you are processing without using temporary files etc. The module nicely encapsulates the common operation of looping over the lines in a list of files, via an object which transparently keeps track of the file name, line number etc if you should want to inspect them inside the loop.

from fileinput import FileInput
for line in FileInput("file", inplace=1):
    line = line.replace("foobar", "bar")
    print(line)

回答 3

老实说,您可以看一下我构建的该类,它执行基本的文件操作。write方法将覆盖并追加保留旧数据。

class IO:
    def read(self, filename):
        toRead = open(filename, "rb")

        out = toRead.read()
        toRead.close()
        
        return out
    
    def write(self, filename, data):
        toWrite = open(filename, "wb")

        out = toWrite.write(data)
        toWrite.close()

    def append(self, filename, data):
        append = self.read(filename)
        self.write(filename, append+data)
        

Honestly you can take a look at this class that I built which does basic file operations. The write method overwrites and append keeps old data.

class IO:
    def read(self, filename):
        toRead = open(filename, "rb")

        out = toRead.read()
        toRead.close()
        
        return out
    
    def write(self, filename, data):
        toWrite = open(filename, "wb")

        out = toWrite.write(data)
        toWrite.close()

    def append(self, filename, data):
        append = self.read(filename)
        self.write(filename, append+data)
        

回答 4

尝试将其写入新文件中。

f = open(filename, 'r+')
f2= open(filename2,'a+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.close()
f2.write(text)
fw.close()

Try writing it in a new file..

f = open(filename, 'r+')
f2= open(filename2,'a+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.close()
f2.write(text)
fw.close()

以“ rt”和“ wt”模式打开文件

问题:以“ rt”和“ wt”模式打开文件

在这里,我有好几次见过人们使用rtwt模式来读写文件。

例如:

with open('input.txt', 'rt') as input_file:
     with open('output.txt', 'wt') as output_file: 
         ...

我没有看到有关模式的文档,但是由于open()不会引发错误-看起来非常合法。

它的作用是什么,使用wtvs wrtvs 之间有什么区别r

Several times here on SO I’ve seen people using rt and wt modes for reading and writing files.

For example:

with open('input.txt', 'rt') as input_file:
     with open('output.txt', 'wt') as output_file: 
         ...

I don’t see the modes documented, but since open() doesn’t throw an error – looks like it’s pretty much legal to use.

What is it for and is there any difference between using wt vs w and rt vs r?


回答 0

t指文本模式。rrtw和与之间没有区别,wt因为文本模式是默认模式。

记录在这里

Character   Meaning
'r'     open for reading (default)
'w'     open for writing, truncating the file first
'x'     open for exclusive creation, failing if the file already exists
'a'     open for writing, appending to the end of the file if it exists
'b'     binary mode
't'     text mode (default)
'+'     open a disk file for updating (reading and writing)
'U'     universal newlines mode (deprecated)

默认模式为'r'(打开以读取文本,为的同义词'rt')。

t refers to the text mode. There is no difference between r and rt or w and wt since text mode is the default.

Documented here:

Character   Meaning
'r'     open for reading (default)
'w'     open for writing, truncating the file first
'x'     open for exclusive creation, failing if the file already exists
'a'     open for writing, appending to the end of the file if it exists
'b'     binary mode
't'     text mode (default)
'+'     open a disk file for updating (reading and writing)
'U'     universal newlines mode (deprecated)

The default mode is 'r' (open for reading text, synonym of 'rt').


回答 1

t显示文本模式,这意味着\n字符将写入文件时,读取时被翻译成主机OS行结束,然后再返回。由于文本模式是默认设置,因此标记基本上只是噪音。

除此之外U,这些模式的标志直接来自标准C库的fopen()功能,即在第六段记录的事实python2文档open()

据我所知,t它不是并且从未成为C标准的一部分,因此尽管C库的许多实现仍然接受了C标准,但并不能保证它们全部都能实现,因此也不能保证它可以在C的每个构建中使用。Python。这就解释了为什么python2文档没有列出它,以及为什么它仍然正常工作。该python3文档使它官员。

The t indicates text mode, meaning that \n characters will be translated to the host OS line endings when writing to a file, and back again when reading. The flag is basically just noise, since text mode is the default.

Other than U, those mode flags come directly from the standard C library’s fopen() function, a fact that is documented in the sixth paragraph of the python2 documentation for open().

As far as I know, t is not and has never been part of the C standard, so although many implementations of the C library accept it anyway, there’s no guarantee that they all will, and therefore no guarantee that it will work on every build of python. That explains why the python2 docs didn’t list it, and why it generally worked anyway. The python3 docs make it official.


回答 2

“ r”用于阅读,“ w”用于书写,“ a”用于附加。

“ t”表示与二进制模式相对应的文本模式。

因此,我在这里有好几次看到人们使用rt和wt模式读取和写入文件。

编辑:您确定您看到rt而不是rb吗?

这些函数通常包装以下fopen函数:

http://www.cplusplus.com/reference/cstdio/fopen/

如您所见,它提到使用b以二进制模式打开文件。

您提供的文档链接也引用了此b模式:

甚至在没有区别对待二进制文件和文本文件的系统上,将’b’用作文档也是很有用的。

The ‘r’ is for reading, ‘w’ for writing and ‘a’ is for appending.

The ‘t’ represents text mode as apposed to binary mode.

Several times here on SO I’ve seen people using rt and wt modes for reading and writing files.

Edit: Are you sure you saw rt and not rb?

These functions generally wrap the fopen function which is described here:

http://www.cplusplus.com/reference/cstdio/fopen/

As you can see it mentions the use of b to open the file in binary mode.

The document link you provided also makes reference to this b mode:

Appending ‘b’ is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.


回答 3

t 表示 text mode

https://docs.python.org/release/3.1.5/library/functions.html#open

在Linux上,文本模式和二进制模式之间没有区别,但是在Windows中,它们会转换\n\r\nwhen文本模式。

http://www.cygwin.com/cygwin-ug-net/using-textbinary.html

t indicates for text mode

https://docs.python.org/release/3.1.5/library/functions.html#open

on linux, there’s no difference between text mode and binary mode, however, in windows, they converts \n to \r\n when text mode.

http://www.cygwin.com/cygwin-ug-net/using-textbinary.html


使用Python,“ wb”在此代码中是什么意思?

问题:使用Python,“ wb”在此代码中是什么意思?

码:

file('pinax/media/a.jpg', 'wb')

Code:

file('pinax/media/a.jpg', 'wb')

回答 0

文件模式,写入和二进制。由于您正在编写.jpg文件,因此看起来不错。

但是,如果您应该阅读该jpg文件,则需要使用 'rb'

更多信息

在Windows上,附加到模式的’b’以二进制模式打开文件,因此也有’rb’,’wb’和’r + b’之类的模式。Windows上的Python区分文本文件和二进制文件。当读取或写入数据时,文本文件中的行尾字符会自动更改。这种对文件数据的幕后修改对于ASCII文本文件而言是很好的选择,但它会破坏JPEG或EXE文件中的二进制数据。

File mode, write and binary. Since you are writing a .jpg file, it looks fine.

But if you supposed to read that jpg file you need to use 'rb'

More info

On Windows, ‘b’ appended to the mode opens the file in binary mode, so there are also modes like ‘rb’, ‘wb’, and ‘r+b’. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files.


回答 1

wb指示文件被打开以二进制方式写作。

当以二进制模式写入时,Python在写入文件时不会对数据进行任何更改。但是,在文本模式下(bw或在用或指定文本模式时排除时wt),Python将基于默认文本编码对文本进行编码。此外,Python会将行尾(\n)转换为特定于平台的行尾,这会破坏二进制文件(例如exepng文件)。

因此,在编写文本文件时(无论是使用纯文本还是基于文本的格式,如CSV)都应使用文本模式,而在编写非文本文件(如图像)时,则必须使用二进制模式。

参考文献:

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files https://docs.python.org/3/library/functions.html#open

The wb indicates that the file is opened for writing in binary mode.

When writing in binary mode, Python makes no changes to data as it is written to the file. In text mode (when the b is excluded as in just w or when you specify text mode with wt), however, Python will encode the text based on the default text encoding. Additionally, Python will convert line endings (\n) to whatever the platform-specific line ending is, which would corrupt a binary file like an exe or png file.

Text mode should therefore be used when writing text files (whether using plain text or a text-based format like CSV), while binary mode must be used when writing non-text files like images.

References:

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files https://docs.python.org/3/library/functions.html#open


回答 2

这就是打开文件的方式。“ wb”表示您正在写入文件(w),并且您正在以二进制模式(b)进行写入。

查看文档以了解更多信息:clicky

That is the mode with which you are opening the file. “wb” means that you are writing to the file (w), and that you are writing in binary mode (b).

Check out the documentation for more: clicky


读取巨大的.csv文件

问题:读取巨大的.csv文件

我目前正在尝试从Python 2.7中的.csv文件读取数据,该文件最多包含100万行和200列(文件范围从100mb到1.6gb)。对于少于300,000行的文件,我可以(非常缓慢地)执行此操作,但是一旦超过该行,就会出现内存错误。我的代码如下所示:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

在getstuff函数中使用else子句的原因是,所有符合条件的元素都将一起列在csv文件中,因此当我经过它们时,为了节省时间,我离开了循环。

我的问题是:

  1. 我如何设法使其与较大的文件一起使用?

  2. 有什么办法可以使它更快?

我的计算机具有8gb RAM,运行64位Windows 7,处理器为3.40 GHz(不确定您需要什么信息)。

I’m currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My questions are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).


回答 0

您正在将所有行读入列表,然后处理该列表。不要那样做

在生成行时对其进行处理。如果需要先过滤数据,请使用生成器函数:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

我还简化了您的过滤器测试;逻辑相同,但更为简洁。

因为只匹配与条件匹配的单个行序列,所以还可以使用:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

您现在可以getstuff()直接循环。在getdata()

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

现在直接getdata()在您的代码中循环:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

现在,您仅在内存中保留一行,而不是每个条件存储数千行。

yield使函数成为生成器函数,这意味着直到开始循环它之前,它不会做任何工作。

You are reading all rows into a list, then processing that list. Don’t do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won’t do any work until you start looping over it.


回答 1

尽管Martijin的答案是最好的。这是为初学者处理大型csv文件的更直观的方法。这使您可以一次处理一组行或块。

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Although Martijin’s answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

回答 2

我进行了大量的振动分析,并研究了大型数据集(数以亿计的点)。我的测试显示pandas.read_csv()函数比numpy.genfromtxt()快20倍。genfromtxt()函数比numpy.loadtxt()快3倍。似乎您需要大数据集的熊猫。

我在博客上讨论了用于测试的代码和数据集,该博客讨论了MATLAB vs Python进行振动分析

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.


回答 3

对我有用的是而且超快速的是

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

另一个可行的解决方案是:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

what worked for me was and is superfast is

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

回答 4

对于着陆这个问题的人。将熊猫与’ chunksize ‘和’ usecols ‘ 一起使用,比其他建议的选项更快地读取了一个巨大的zip文件。

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

回答 5

这是Python3的另一个解决方案:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

datareader是一个生成器函数。

here’s another solution for Python3:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

here datareader is a generator function.


回答 6

如果您使用的是熊猫并且有很多RAM(足以将整个文件读入内存),请尝试使用pd.read_csvwith low_memory=False,例如:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)

If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csv with low_memory=False, e.g.:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)