标签归档:csv

使用Python编辑CSV文件时跳过标题

问题:使用Python编辑CSV文件时跳过标题

我正在使用以下引用的代码使用Python编辑CSV。代码中调用的函数构成了代码的上部。

问题:我希望下面引用的代码从第二行开始编辑csv,我希望它排除包含标题的第一行。现在,它仅在第一行上应用函数,并且我的标题行正在更改。

in_file = open("tmob_notcleaned.csv", "rb")
reader = csv.reader(in_file)
out_file = open("tmob_cleaned.csv", "wb")
writer = csv.writer(out_file)
row = 1
for row in reader:
    row[13] = handle_color(row[10])[1].replace(" - ","").strip()
    row[10] = handle_color(row[10])[0].replace("-","").replace("(","").replace(")","").strip()
    row[14] = handle_gb(row[10])[1].replace("-","").replace(" ","").replace("GB","").strip()
    row[10] = handle_gb(row[10])[0].strip()
    row[9] = handle_oem(row[10])[1].replace("Blackberry","RIM").replace("TMobile","T-Mobile").strip()
    row[15] = handle_addon(row[10])[1].strip()
    row[10] = handle_addon(row[10])[0].replace(" by","").replace("FREE","").strip()
    writer.writerow(row)
in_file.close()    
out_file.close()

我试图通过将row变量初始化为来解决此问题,1但没有成功。

请帮助我解决这个问题。

I am using below referred code to edit a csv using Python. Functions called in the code form upper part of the code.

Problem: I want the below referred code to start editing the csv from 2nd row, I want it to exclude 1st row which contains headers. Right now it is applying the functions on 1st row only and my header row is getting changed.

in_file = open("tmob_notcleaned.csv", "rb")
reader = csv.reader(in_file)
out_file = open("tmob_cleaned.csv", "wb")
writer = csv.writer(out_file)
row = 1
for row in reader:
    row[13] = handle_color(row[10])[1].replace(" - ","").strip()
    row[10] = handle_color(row[10])[0].replace("-","").replace("(","").replace(")","").strip()
    row[14] = handle_gb(row[10])[1].replace("-","").replace(" ","").replace("GB","").strip()
    row[10] = handle_gb(row[10])[0].strip()
    row[9] = handle_oem(row[10])[1].replace("Blackberry","RIM").replace("TMobile","T-Mobile").strip()
    row[15] = handle_addon(row[10])[1].strip()
    row[10] = handle_addon(row[10])[0].replace(" by","").replace("FREE","").strip()
    writer.writerow(row)
in_file.close()    
out_file.close()

I tried to solve this problem by initializing row variable to 1 but it didn’t work.

Please help me in solving this issue.


回答 0

您的reader变量是可迭代的,通过循环它可以检索行。

要使其在循环前跳过一项,只需调用next(reader, None)并忽略返回值即可。

您还可以稍微简化代码;使用打开的文件作为上下文管理器可以自动关闭它们:

with open("tmob_notcleaned.csv", "rb") as infile, open("tmob_cleaned.csv", "wb") as outfile:
   reader = csv.reader(infile)
   next(reader, None)  # skip the headers
   writer = csv.writer(outfile)
   for row in reader:
       # process each row
       writer.writerow(row)

# no need to close, the files are closed automatically when you get to this point.

如果您想将标头写入未经处理的输出文件,也很容易,请将输出传递next()writer.writerow()

headers = next(reader, None)  # returns the headers or `None` if the input is empty
if headers:
    writer.writerow(headers)

Your reader variable is an iterable, by looping over it you retrieve the rows.

To make it skip one item before your loop, simply call next(reader, None) and ignore the return value.

You can also simplify your code a little; use the opened files as context managers to have them closed automatically:

with open("tmob_notcleaned.csv", "rb") as infile, open("tmob_cleaned.csv", "wb") as outfile:
   reader = csv.reader(infile)
   next(reader, None)  # skip the headers
   writer = csv.writer(outfile)
   for row in reader:
       # process each row
       writer.writerow(row)

# no need to close, the files are closed automatically when you get to this point.

If you wanted to write the header to the output file unprocessed, that’s easy too, pass the output of next() to writer.writerow():

headers = next(reader, None)  # returns the headers or `None` if the input is empty
if headers:
    writer.writerow(headers)

回答 1

解决此问题的另一种方法是使用DictReader类,该类“跳过”标题行并将其用于允许命名索引。

给定“ foo.csv”,如下所示:

FirstColumn,SecondColumn
asdf,1234
qwer,5678

像这样使用DictReader:

import csv
with open('foo.csv') as f:
    reader = csv.DictReader(f, delimiter=',')
    for row in reader:
        print(row['FirstColumn'])  # Access by column header instead of column number
        print(row['SecondColumn'])

Another way of solving this is to use the DictReader class, which “skips” the header row and uses it to allowed named indexing.

Given “foo.csv” as follows:

FirstColumn,SecondColumn
asdf,1234
qwer,5678

Use DictReader like this:

import csv
with open('foo.csv') as f:
    reader = csv.DictReader(f, delimiter=',')
    for row in reader:
        print(row['FirstColumn'])  # Access by column header instead of column number
        print(row['SecondColumn'])

回答 2

在做 row=1不会改变任何东西,因为您只会用循环的结果覆盖它。

您要next(reader)跳过一行。

Doing row=1 won’t change anything, because you’ll just overwrite that with the results of the loop.

You want to do next(reader) to skip one row.


将新行附加到旧的csv文件python

问题:将新行附加到旧的csv文件python

我正在尝试在旧的csv文件中添加新行。基本上,每次我运行Python脚本时都会对其进行更新。

现在,我将旧的csv行值存储在列表中,然后删除csv文件,并使用新的列表值再次创建它。

想知道是否有更好的方法可以做到这一点。

I am trying to add a new row to my old csv file. Basically, it gets updated each time I run the Python script.

Right now I am storing the old csv rows values in a list and then deleting the csv file and creating it again with the new list value.

Wanted to know are there any better ways of doing this.


回答 0

with open('document.csv','a') as fd:
    fd.write(myCsvRow)

使用'a'参数打开文件可以使您追加到文件末尾,而不仅仅是覆盖现有内容。试试看

with open('document.csv','a') as fd:
    fd.write(myCsvRow)

Opening a file with the 'a' parameter allows you to append to the end of the file instead of simply overwriting the existing content. Try that.


回答 1

我更喜欢使用csv标准库中的模块和with语句的此解决方案,以避免使文件保持打开状态。

关键点是'a'用于在打开文件时进行追加。

import csv   
fields=['first','second','third']
with open(r'name', 'a') as f:
    writer = csv.writer(f)
    writer.writerow(fields)

如果您使用的是Python 2.7,则在Windows中可能会遇到多余的新行。您可以尝试避免使用它们,'ab'而不是'a'这样做,但是会导致TypeError:需要一个类似字节的对象,而不是python中的“ str”和 Python 3.6中的CSVnewline=''正如Natacha所建议的那样,添加会导致您在Python 2和3之间向后不兼容

I prefer this solution using the csv module from the standard library and the with statement to avoid leaving the file open.

The key point is using 'a' for appending when you open the file.

import csv   
fields=['first','second','third']
with open(r'name', 'a') as f:
    writer = csv.writer(f)
    writer.writerow(fields)

If you are using Python 2.7 you may experience superfluous new lines in Windows. You can try to avoid them using 'ab' instead of 'a' this will, however, cause you TypeError: a bytes-like object is required, not ‘str’ in python and CSV in Python 3.6. Adding the newline='', as Natacha suggests, will cause you a backward incompatibility between Python 2 and 3.


回答 2

基于@GM的回答并注意@John La Rooy的警告,我能够添加新行以'a'模式打开文件。

即使在Windows中,为了避免换行问题,也必须将其声明为newline=''

现在,您可以在'a'模式下打开文件(不带b)。

import csv

with open(r'names.csv', 'a', newline='') as csvfile:
    fieldnames = ['This','aNew']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writerow({'This':'is', 'aNew':'Row'})

我没有和普通作家一起尝试(没有Dict),但我认为也可以。

Based in the answer of @G M and paying attention to the @John La Rooy’s warning, I was able to append a new row opening the file in 'a'mode.

Even in windows, in order to avoid the newline problem, you must declare it as newline=''.

Now you can open the file in 'a'mode (without the b).

import csv

with open(r'names.csv', 'a', newline='') as csvfile:
    fieldnames = ['This','aNew']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writerow({'This':'is', 'aNew':'Row'})

I didn’t try with the regular writer (without the Dict), but I think that it’ll be ok too.


回答 3

您是否以“ a”模式而不是“ w”模式打开文件?

请参阅python文档中的读写文件

7.2。读写文件

open()返回一个文件对象,并且最常与两个参数一起使用:open(filename,mode)。

>>> f = open('workfile', 'w')
>>> print f <open file 'workfile', mode 'w' at 80a0960>

第一个参数是包含文件名的字符串。第二个参数是另一个包含一些字符的字符串,这些字符描述了文件的使用方式。当仅读取文件时,模式可以为“ r”,仅用于写入时为“ w”(具有相同名称的现有文件将被删除),并且“ a”打开文件以进行追加;写入文件的所有数据都会自动添加到末尾。“ r +”打开文件以供读取和写入。mode参数是可选的;如果省略,则假定为“ r”。

在Windows上,附加到模式的’b’以二进制模式打开文件,因此也有’rb’,’wb’和’r + b’之类的模式。Windows上的Python区分文本文件和二进制文件。读取或写入数据时,文本文件中的行尾字符会自动更改。这种对文件数据的幕后修改对于ASCII文本文件来说是很好的选择,但它会破坏JPEG或EXE文件中的二进制数据。读写此类文件时,请务必小心使用二进制模式。在Unix上,将’b’附加到该模式并没有什么坏处,因此您可以在平台上独立地将其用于所有二进制文件。

Are you opening the file with mode of ‘a’ instead of ‘w’?

See Reading and Writing Files in the python docs

7.2. Reading and Writing Files

open() returns a file object, and is most commonly used with two arguments: open(filename, mode).

>>> f = open('workfile', 'w')
>>> print f <open file 'workfile', mode 'w' at 80a0960>

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be ‘r’ when the file will only be read, ‘w’ for only writing (an existing file with the same name will be erased), and ‘a’ opens the file for appending; any data written to the file is automatically added to the end. ‘r+’ opens the file for both reading and writing. The mode argument is optional; ‘r’ will be assumed if it’s omitted.

On Windows, ‘b’ appended to the mode opens the file in binary mode, so there are also modes like ‘rb’, ‘wb’, and ‘r+b’. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a ‘b’ to the mode, so you can use it platform-independently for all binary files.


回答 4

如果文件存在并且包含数据,则可以自动生成fieldname参数csv.DictWriter

# read header automatically
with open(myFile, "r") as f:
    reader = csv.reader(f)
    for header in reader:
        break

# add row to CSV file
with open(myFile, "a", newline='') as f:
    writer = csv.DictWriter(f, fieldnames=header)
    writer.writerow(myDict)

If the file exists and contains data, then it is possible to generate the fieldname parameter for csv.DictWriter automatically:

# read header automatically
with open(myFile, "r") as f:
    reader = csv.reader(f)
    for header in reader:
        break

# add row to CSV file
with open(myFile, "a", newline='') as f:
    writer = csv.DictWriter(f, fieldnames=header)
    writer.writerow(myDict)

回答 5

# I like using the codecs opening in a with 
field_names = ['latitude', 'longitude', 'date', 'user', 'text']
with codecs.open(filename,"ab", encoding='utf-8') as logfile:
    logger = csv.DictWriter(logfile, fieldnames=field_names)
    logger.writeheader()

# some more code stuff 

    for video in aList:
        video_result = {}                                     
        video_result['date'] = video['snippet']['publishedAt']
        video_result['user'] = video['id']
        video_result['text'] = video['snippet']['description'].encode('utf8')
        logger.writerow(video_result) 
# I like using the codecs opening in a with 
field_names = ['latitude', 'longitude', 'date', 'user', 'text']
with codecs.open(filename,"ab", encoding='utf-8') as logfile:
    logger = csv.DictWriter(logfile, fieldnames=field_names)
    logger.writeheader()

# some more code stuff 

    for video in aList:
        video_result = {}                                     
        video_result['date'] = video['snippet']['publishedAt']
        video_result['user'] = video['id']
        video_result['text'] = video['snippet']['description'].encode('utf8')
        logger.writerow(video_result) 

回答 6

我使用这种方式在.csv文件中添加新行:

pose_x = 1 
pose_y = 2

with open('path-to-your-csv-file.csv', mode='a') as file_:
    file_.write("{},{}".format(pose_x, pose_y))
    file_.write("\n")

I follow this way to append a new line in a .csv file:

pose_x = 1 
pose_y = 2

with open('path-to-your-csv-file.csv', mode='a') as file_:
    file_.write("{},{}".format(pose_x, pose_y))
    file_.write("\n")

使用pandas.to_datetime时仅保留日期部分

问题:使用pandas.to_datetime时仅保留日期部分

pandas.to_datetime用来解析数据中的日期。默认情况下,熊猫代表日期,datetime64[ns]即使所有日期都是每天也是如此。我想知道是否存在一种优雅/巧妙的方法来将日期转换为datetime.date或,datetime64[D]以便当我将数据写入CSV时,日期不附加00:00:00。我知道我可以手动逐个元素地转换类型:

[dt.to_datetime().date() for dt in df.dates]

但这确实很慢,因为我有很多行,这有点违反了使用目的pandas.to_datetime。有没有一种方法可以一次转换dtype整个列?或者,是否pandas.to_datetime支持精度规范,以便在处理日常数据时可以省去时间部分?

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only. I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:

[dt.to_datetime().date() for dt in df.dates]

But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?


回答 0

从版本开始,0.15.0现在可以轻松地通过.dt仅访问日期组件来完成此操作:

df['just_date'] = df['dates'].dt.date

上面的方法返回一个datetime.datedtype,如果您想要一个a,datetime64则可以normalize将时间分量设置为午夜,以便将所有值设置为00:00:00

df['normalised_date'] = df['dates'].dt.normalize()

这会使dtype保持不变,datetime64但显示屏仅显示该date值。

Since version 0.15.0 this can now be easily done using .dt to access just the date component:

df['just_date'] = df['dates'].dt.date

The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:

df['normalised_date'] = df['dates'].dt.normalize()

This keeps the dtype as datetime64, but the display shows just the date value.


回答 1

简单的解决方案:

df['date_only'] = df['date_time_column'].dt.date

Simple Solution:

df['date_only'] = df['date_time_column'].dt.date

回答 2

虽然我赞成EdChum的答案,这是对OP提出的问题的最直接答案,但它并不能真正解决性能问题(它仍然依赖于python datetime对象,因此对它们的任何操作都不会被矢量化-即,它会很慢)。

性能更好的替代方法是使用df['dates'].dt.floor('d')。严格来说,它不会“仅保留日期部分”,因为它只是将时间设置为00:00:00。但是它确实可以按OP的要求运行,例如:

  • 打印到屏幕
  • 保存到csv
  • 使用列来 groupby

…并且效率更高,因为该操作已矢量化。

编辑:其实,在OP的宁愿答案很可能是“最近的版本pandas没有时间写为csv如果是00:00:00对所有的意见”。

While I upvoted EdChum’s answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized – that is, it will be slow).

A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not “keep only date part”, since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:

  • printing to screen
  • saving to csv
  • using the column to groupby

… and it is much more efficient, since the operation is vectorized.

EDIT: in fact, the answer the OP’s would have preferred is probably “recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations”.


回答 3

熊猫DatetimeIndexSeries有一种方法normalize可以完全满足您的需求。

您可以在此答案中了解更多信息。

可以用作 ser.dt.normalize()

Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.

You can read more about it in this answer.

It can be used as ser.dt.normalize()


回答 4

熊猫v0.13 +:to_csvdate_format参数一起使用

尽可能避免将您的datetime64[ns]系列转换为objectdtype系列的datetime.date对象。后者通常使用构造pd.Series.dt.date,存储为指针数组,相对于基于NumPy的纯序列而言效率低下。

由于在写入CSV时您关注的是格式,因此只需使用date_format参数to_csv。例如:

df.to_csv(filename, date_format='%Y-%m-%d')

有关格式设置约定,请参见Python的strftime指令

Pandas v0.13+: Use to_csv with date_format parameter

Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.

Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:

df.to_csv(filename, date_format='%Y-%m-%d')

See Python’s strftime directives for formatting conventions.


回答 5

这是提取日期的简单方法:

import pandas as pd

d='2015-01-08 22:44:09' 
date=pd.to_datetime(d).date()
print(date)

This is a simple way to extract the date:

import pandas as pd

d='2015-01-08 22:44:09' 
date=pd.to_datetime(d).date()
print(date)

回答 6

转换为datetime64[D]

df.dates.values.astype('M8[D]')

尽管将其重新分配给DataFrame col将其恢复为[ns]。

如果您想要实际的datetime.date

dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])

Converting to datetime64[D]:

df.dates.values.astype('M8[D]')

Though re-assigning that to a DataFrame col will revert it back to [ns].

If you wanted actual datetime.date:

dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])

回答 7

如果有人看到此旧帖子,请给出一个最新的答案。

转换为日期时间时添加“ utc = False”将删除时区部分,仅将日期保留为datetime64 [ns]数据类型。

pd.to_datetime(df['Date'], utc=False)

您将能够将其保存在excel中,而不会出现错误“ ValueError:Excel不支持带时区的日期时间。在写入Excel之前,请确保日期时间不知道时区。”

Just giving a more up to date answer in case someone sees this old post.

Adding “utc=False” when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.

pd.to_datetime(df['Date'], utc=False)

You will be able to save it in excel without getting the error “ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel.”


回答 8

我希望能够更改数据框中一组列的类型,然后删除保持一天的时间。round(),floor(),ceil()全部工作

df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))

I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work

df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))

Python CSV字符串到数组

问题:Python CSV字符串到数组

有人知道一个简单的库或函数来解析csv编码的字符串并将其转换为数组或字典吗?

我不认为我想要内置的csv模块,因为在所有示例中,我看到的都是文件路径,而不是字符串。

Anyone know of a simple library or function to parse a csv encoded string and turn it into an array or dictionary?

I don’t think I want the built in csv module because in all the examples I’ve seen that takes filepaths, not strings.


回答 0

您可以使用将字符串转换为文件对象io.StringIO,然后将其传递给csv模块:

from io import StringIO
import csv

scsv = """text,with,Polish,non-Latin,letters
1,2,3,4,5,6
a,b,c,d,e,f
gęś,zółty,wąż,idzie,wąską,dróżką,
"""

f = StringIO(scsv)
reader = csv.reader(f, delimiter=',')
for row in reader:
    print('\t'.join(row))

带有split()换行符的简单版本:

reader = csv.reader(scsv.split('\n'), delimiter=',')
for row in reader:
    print('\t'.join(row))

或者,您可以split()使用\n分隔符将此字符串简单地分成几行,然后将split()每一行变成值,但是这种方式您必须知道引号,因此csv首选使用module。

Python 2上,您必须导入StringIO

from StringIO import StringIO

代替。

You can convert a string to a file object using io.StringIO and then pass that to the csv module:

from io import StringIO
import csv

scsv = """text,with,Polish,non-Latin,letters
1,2,3,4,5,6
a,b,c,d,e,f
gęś,zółty,wąż,idzie,wąską,dróżką,
"""

f = StringIO(scsv)
reader = csv.reader(f, delimiter=',')
for row in reader:
    print('\t'.join(row))

simpler version with split() on newlines:

reader = csv.reader(scsv.split('\n'), delimiter=',')
for row in reader:
    print('\t'.join(row))

Or you can simply split() this string into lines using \n as separator, and then split() each line into values, but this way you must be aware of quoting, so using csv module is preferred.

On Python 2 you have to import StringIO as

from StringIO import StringIO

instead.


回答 1

简单-csv模块也可以使用列表:

>>> a=["1,2,3","4,5,6"]  # or a = "1,2,3\n4,5,6".split('\n')
>>> import csv
>>> x = csv.reader(a)
>>> list(x)
[['1', '2', '3'], ['4', '5', '6']]

Simple – the csv module works with lists, too:

>>> a=["1,2,3","4,5,6"]  # or a = "1,2,3\n4,5,6".split('\n')
>>> import csv
>>> x = csv.reader(a)
>>> list(x)
[['1', '2', '3'], ['4', '5', '6']]

回答 2

csv.reader() https://docs.python.org/2/library/csv.html的官方文档 非常有帮助,它说

文件对象和列表对象都适合

import csv

text = """1,2,3
a,b,c
d,e,f"""

lines = text.splitlines()
reader = csv.reader(lines, delimiter=',')
for row in reader:
    print('\t'.join(row))

The official doc for csv.reader() https://docs.python.org/2/library/csv.html is very helpful, which says

file objects and list objects are both suitable

import csv

text = """1,2,3
a,b,c
d,e,f"""

lines = text.splitlines()
reader = csv.reader(lines, delimiter=',')
for row in reader:
    print('\t'.join(row))

回答 3

>>> a = "1,2"
>>> a
'1,2'
>>> b = a.split(",")
>>> b
['1', '2']

解析CSV文件:

f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if needed

for line in lines:
    if line != "": # add other needed checks to skip titles
        cols = line.split(",")
        print cols
>>> a = "1,2"
>>> a
'1,2'
>>> b = a.split(",")
>>> b
['1', '2']

To parse a CSV file:

f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if needed

for line in lines:
    if line != "": # add other needed checks to skip titles
        cols = line.split(",")
        print cols

回答 4

正如其他人已经指出的那样,Python包含一个用于读取和写入CSV文件的模块。只要输入字符保持在ASCII限制内,它就可以很好地工作。如果您要处理其他编码,则需要做更多的工作。

csv模块Python文档实现了csv.reader的扩展,该扩展使用相同的接口,但可以处理其他编码并返回unicode字符串。只需复制并粘贴文档中的代码即可。之后,您可以像这样处理CSV文件:

with open("some.csv", "rb") as csvFile: 
    for row in UnicodeReader(csvFile, encoding="iso-8859-15"):
        print row

As others have already pointed out, Python includes a module to read and write CSV files. It works pretty well as long as the input characters stay within ASCII limits. In case you want to process other encodings, more work is needed.

The Python documentation for the csv module implements an extension of csv.reader, which uses the same interface but can handle other encodings and returns unicode strings. Just copy and paste the code from the documentation. After that, you can process a CSV file like this:

with open("some.csv", "rb") as csvFile: 
    for row in UnicodeReader(csvFile, encoding="iso-8859-15"):
        print row

回答 5

根据文档:

尽管该模块不直接支持解析字符串,但可以轻松实现:

import csv
for row in csv.reader(['one,two,three']):
    print row

只需将您的字符串转换为单个元素列表即可。

当这个例子在文档中明确时,导入StringIO对我来说似乎有点多余。

Per the documentation:

And while the module doesn’t directly support parsing strings, it can easily be done:

import csv
for row in csv.reader(['one,two,three']):
    print row

Just turn your string into a single element list.

Importing StringIO seems a bit excessive to me when this example is explicitly in the docs.


回答 6

https://docs.python.org/2/library/csv.html?highlight=csv#csv.reader

csvfile可以是任何支持迭代器协议的对象,并且每次调用其next()方法时都会返回一个字符串

因此,一个StringIO.StringIO()str.splitlines()甚至一个生成器都很好。

https://docs.python.org/2/library/csv.html?highlight=csv#csv.reader

csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

Thus, a StringIO.StringIO(), str.splitlines() or even a generator are all good.


回答 7

这是一个替代解决方案:

>>> import pyexcel as pe
>>> text="""1,2,3
... a,b,c
... d,e,f"""
>>> s = pe.load_from_memory('csv', text)
>>> s
Sheet Name: csv
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| a | b | c |
+---+---+---+
| d | e | f |
+---+---+---+
>>> s.to_array()
[[u'1', u'2', u'3'], [u'a', u'b', u'c'], [u'd', u'e', u'f']]

这是文档

Here’s an alternative solution:

>>> import pyexcel as pe
>>> text="""1,2,3
... a,b,c
... d,e,f"""
>>> s = pe.load_from_memory('csv', text)
>>> s
Sheet Name: csv
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| a | b | c |
+---+---+---+
| d | e | f |
+---+---+---+
>>> s.to_array()
[[u'1', u'2', u'3'], [u'a', u'b', u'c'], [u'd', u'e', u'f']]

Here’s the documentation


回答 8

使用此功能将csv加载到列表中

import csv

csvfile = open(myfile, 'r')
reader = csv.reader(csvfile, delimiter='\t')
my_list = list(reader)
print my_list
>>>[['1st_line', '0'],
    ['2nd_line', '0']]

Use this to have a csv loaded into a list

import csv

csvfile = open(myfile, 'r')
reader = csv.reader(csvfile, delimiter='\t')
my_list = list(reader)
print my_list
>>>[['1st_line', '0'],
    ['2nd_line', '0']]

回答 9

Panda是功能强大且智能的库,可使用Python读取CSV

这里有一个简单的例子,我有example.zip文件,其中有四个文件。

EXAMPLE.zip
 -- example1.csv
 -- example1.txt
 -- example2.csv
 -- example2.txt

from zipfile import ZipFile
import pandas as pd


filepath = 'EXAMPLE.zip'
file_prefix = filepath[:-4].lower()

zipfile = ZipFile(filepath)
target_file = ''.join([file_prefix, '/', file_prefix, 1 , '.csv'])

df = pd.read_csv(zipfile.open(target_file))

print(df.head()) # print first five row of csv
print(df[COL_NAME]) # fetch the col_name data

有了数据后,您就可以操纵播放列表或其他格式。

Panda is quite powerful and smart library reading CSV in Python

A simple example here, I have example.zip file with four files in it.

EXAMPLE.zip
 -- example1.csv
 -- example1.txt
 -- example2.csv
 -- example2.txt

from zipfile import ZipFile
import pandas as pd


filepath = 'EXAMPLE.zip'
file_prefix = filepath[:-4].lower()

zipfile = ZipFile(filepath)
target_file = ''.join([file_prefix, '/', file_prefix, 1 , '.csv'])

df = pd.read_csv(zipfile.open(target_file))

print(df.head()) # print first five row of csv
print(df[COL_NAME]) # fetch the col_name data

Once you have data you can manipulate to play with a list or other formats.


如何使用pandas读取较大的csv文件?

问题:如何使用pandas读取较大的csv文件?

我试图在熊猫中读取较大的csv文件(大约6 GB),但出现内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

有什么帮助吗?

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

Any help on this?


回答 0

该错误表明机器没有足够的内存来一次将整个CSV读入DataFrame。假设您一次也不需要整个内存中的整个数据集,一种避免问题的方法是分批处理CSV(通过指定chunksize参数):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

chunksize参数指定每个块的行数。(当然,最后一块可能少于chunksize行。)

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)


回答 1

分块不一定总是解决此问题的第一站。

  1. 文件是否由于重复的非数字数据或不需要的列而变大?

    如果是这样,您有时可以通过读取列作为类别并通过pd.read_csv usecols参数选择所需的列来节省大量内存。

  2. 您的工作流程是否需要切片,操作,导出?

    如果是这样,则可以使用dask.dataframe进行切片,执行计算并迭代导出。打包由dask静默执行,它也支持pandas API的子集。

  3. 如果所有其他方法均失败,请通过块逐行读取。

    作为最后手段,可以通过熊猫csv库进行分块。

Chunking shouldn’t always be the first port of call for this problem.

  1. Is the file large due to repeated non-numeric data or unwanted columns?

    If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

  2. Does your workflow require slicing, manipulating, exporting?

    If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

  3. If all else fails, read line by line via chunks.

    Chunk via pandas or via csv library as a last resort.


回答 2

我这样进行:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

回答 3

对于大数据,我建议您使用库“ dask”,
例如:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

您可以从此处阅读更多文档。

另一个很好的选择是使用modin,因为所有功能都与pandas相同,但它利用了dask等分布式数据框架库。

For large data l recommend you use the library “dask”
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentation here.

Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.


回答 4

上面的答案已经满足了这个主题。无论如何,如果您需要内存中的所有数据,请查看bcolz。它压缩内存中的数据。我有非常好的经验。但是它缺少许多熊猫功能

编辑:我得到的压缩率大约是我认为的1/10或原始大小,这当然取决于数据类型。缺少的重要功能是聚合。

The above answer is already satisfying the topic. Anyway, if you need all the data in memory – have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.


回答 5

您可以将数据读取为大块,并将每个大块另存为泡菜。

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

在下一步中,您将读取泡菜并将每个泡菜附加到所需的数据框中。

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

回答 6

函数read_csv和read_table几乎相同。但是,在程序中使用函数read_table时,必须分配定界符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

回答 7

解决方案1:

使用大数据的熊猫

解决方案2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

回答 8

下面是一个示例:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

Here follows an example:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

回答 9

您可以尝试sframe,它的语法与pandas相同,但允许您处理大于RAM的文件。

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.


回答 10

如果您使用熊猫将大文件读入块中,然后逐行产生,这就是我所做的

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

回答 11

我想根据已经提供的大多数潜在解决方案做出更全面的回答。我还想指出另一种可能有助于阅读过程的潜在帮助。

选项1:dtypes

“ dtypes”是一个非常强大的参数,可用于减少read方法的内存压力。看到这个这个答案。熊猫默认情况下会尝试推断数据的dtypes。

参照数据结构,存储的每个数据都会进行内存分配。在基本级别上,请参考以下值(下表说明了C编程语言的值):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

请参阅页面以查看NumPy和C类型之间的匹配。

假设您有一个由数字组成的整数数组。您可以在理论上和实践上都进行分配,比如说16位整数类型的数组,但是您分配的内存将比实际存储该数组所需的更多。为防止这种情况,您可以dtype在上设置选项read_csv。您不希望将数组项存储为长整数,而实际上可以使用8位整数(np.int8np.uint8)来使它们适合。

观察以下dtype映射。

资料来源:https : //pbpython.com/pandas_dtypes.html

您可以像在{column:type}一样将dtype参数作为参数传递给pandas方法read

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

选项2:大块读取

逐块读取数据使您可以访问内存中的部分数据,并且可以对数据进行预处理,并保留处理后的数据而不是原始数据。如果将此选项与第一个dtypes结合使用会更好。

我想指出该过程的“熊猫食谱”部分,您可以在这里找到它。注意那两个部分;

选项3:达斯

Dask是在Dask网站上定义为的框架:

Dask为分析提供高级并行性,从而为您喜欢的工具提供大规模性能

它的诞生是为了覆盖熊猫无法到达的必要部分。Dask是一个功能强大的框架,通过以分布式方式处理它,可以使您访问更多数据。

您可以使用dask预处理整个数据,Dask负责分块部分,因此与熊猫不同,您可以定义处理步骤并让Dask完成工作。Dask不会在compute和和/或显式推送计算之前应用这些计算persist(有关差异,请参见此处的答案)。

其他援助(想法)

  • 为数据设计的ETL流。仅保留原始数据中需要的内容。
    • 首先,使用Dask或PySpark之类的框架将ETL应用于整个数据,然后导出处理后的数据。
    • 然后查看处理后的数据是否可以整体容纳在内存中。
  • 考虑增加RAM。
  • 考虑在云平台上使用该数据。

I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

Option 1: dtypes

“dtypes” is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

Refer to this page to see the matching between NumPy and C types.

Let’s say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).

Observe the following dtype map.

Source: https://pbpython.com/pandas_dtypes.html

You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

Option 2: Read by Chunks

Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It’d be much better if you combine this option with the first one, dtypes.

I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;

Option 3: Dask

Dask is a framework that is defined in Dask’s website as:

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).

Other Aids (Ideas)

  • ETL flow designed for the data. Keeping only what is needed from the raw data.
    • First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
    • Then see if the processed data can be fit in the memory as a whole.
  • Consider increasing your RAM.
  • Consider working with that data on a cloud platform.

回答 12

除了上述答案之外,对于那些想要处理CSV然后导出到csv,镶木地板或SQL的用户来说,d6tstack是另一个不错的选择。您可以加载多个文件,并且它处理数据架构更改(添加/删除的列)。已经内置了核心支持之外的其他功能。

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

回答 13

如果有人仍在寻找这样的东西,我发现这个叫做modin的新库可以提供帮助。它使用可以帮助读取的分布式计算。这是一篇很好的文章,比较了它与熊猫的功能。它基本上使用与熊猫相同的功能。

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here’s a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

回答 14

在使用chunksize选项之前,如果要确定要在@unutbu所提到的分块for循环中编写的过程函数,可以简单地使用nrows选项。

small_df = pd.read_csv(filename, nrows=100)

一旦确定过程块已准备就绪,就可以将其放入整个数据帧的块循环中。

Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.

small_df = pd.read_csv(filename, nrows=100)

Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.


Python将csv导入列表

问题:Python将csv导入列表

我有一个大约有2000条记录的CSV文件。

每个记录都有一个字符串和一个类别:

This is the first line,Line1
This is the second line,Line2
This is the third line,Line3

我需要将此文件读入如下列表:

data = [('This is the first line', 'Line1'),
        ('This is the second line', 'Line2'),
        ('This is the third line', 'Line3')]

如何使用Python将CSV导入到我需要的列表中?

I have a CSV file with about 2000 records.

Each record has a string, and a category to it:

This is the first line,Line1
This is the second line,Line2
This is the third line,Line3

I need to read this file into a list that looks like this:

data = [('This is the first line', 'Line1'),
        ('This is the second line', 'Line2'),
        ('This is the third line', 'Line3')]

How can import this CSV to the list I need using Python?


回答 0

使用csv模块

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data)

输出:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

如果您需要元组:

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = [tuple(row) for row in reader]

print(data)

输出:

[('This is the first line', 'Line1'), ('This is the second line', 'Line2'), ('This is the third line', 'Line3')]

旧的Python 2答案,也使用csv模块:

import csv
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list
# [['This is the first line', 'Line1'],
#  ['This is the second line', 'Line2'],
#  ['This is the third line', 'Line3']]

Using the csv module:

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data)

Output:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

If you need tuples:

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = [tuple(row) for row in reader]

print(data)

Output:

[('This is the first line', 'Line1'), ('This is the second line', 'Line2'), ('This is the third line', 'Line3')]

Old Python 2 answer, also using the csv module:

import csv
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list
# [['This is the first line', 'Line1'],
#  ['This is the second line', 'Line2'],
#  ['This is the third line', 'Line3']]

回答 1

已针对Python 3更新:

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print(your_list)

输出:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

Updated for Python 3:

import csv

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print(your_list)

Output:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

回答 2

熊猫非常擅长处理数据。这是一个如何使用它的示例:

import pandas as pd

# Read the CSV into a pandas data frame (df)
#   With a df you can do many things
#   most important: visualize data with Seaborn
df = pd.read_csv('filename.csv', delimiter=',')

# Or export it in many ways, e.g. a list of tuples
tuples = [tuple(x) for x in df.values]

# or export it as a list of dicts
dicts = df.to_dict().values()

一大优势是,熊猫自动处理标题行。

如果您还没有听说过Seaborn,建议您看看。

另请参阅:如何使用Python读写CSV文件?

熊猫#2

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
dicts = df.to_dict('records')

df的内容是:

     country   population population_time    EUR
0    Germany   82521653.0      2016-12-01   True
1     France   66991000.0      2017-01-01   True
2  Indonesia  255461700.0      2017-01-01  False
3    Ireland    4761865.0             NaT   True
4      Spain   46549045.0      2017-06-01   True
5    Vatican          NaN             NaT   True

字典的内容是

[{'country': 'Germany', 'population': 82521653.0, 'population_time': Timestamp('2016-12-01 00:00:00'), 'EUR': True},
 {'country': 'France', 'population': 66991000.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': True},
 {'country': 'Indonesia', 'population': 255461700.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': False},
 {'country': 'Ireland', 'population': 4761865.0, 'population_time': NaT, 'EUR': True},
 {'country': 'Spain', 'population': 46549045.0, 'population_time': Timestamp('2017-06-01 00:00:00'), 'EUR': True},
 {'country': 'Vatican', 'population': nan, 'population_time': NaT, 'EUR': True}]

熊猫#3

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
lists = [[row[col] for col in df.columns] for row in df.to_dict('records')]

的内容lists是:

[['Germany', 82521653.0, Timestamp('2016-12-01 00:00:00'), True],
 ['France', 66991000.0, Timestamp('2017-01-01 00:00:00'), True],
 ['Indonesia', 255461700.0, Timestamp('2017-01-01 00:00:00'), False],
 ['Ireland', 4761865.0, NaT, True],
 ['Spain', 46549045.0, Timestamp('2017-06-01 00:00:00'), True],
 ['Vatican', nan, NaT, True]]

Pandas is pretty good at dealing with data. Here is one example how to use it:

import pandas as pd

# Read the CSV into a pandas data frame (df)
#   With a df you can do many things
#   most important: visualize data with Seaborn
df = pd.read_csv('filename.csv', delimiter=',')

# Or export it in many ways, e.g. a list of tuples
tuples = [tuple(x) for x in df.values]

# or export it as a list of dicts
dicts = df.to_dict().values()

One big advantage is that pandas deals automatically with header rows.

If you haven’t heard of Seaborn, I recommend having a look at it.

See also: How do I read and write CSV files with Python?

Pandas #2

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
dicts = df.to_dict('records')

The content of df is:

     country   population population_time    EUR
0    Germany   82521653.0      2016-12-01   True
1     France   66991000.0      2017-01-01   True
2  Indonesia  255461700.0      2017-01-01  False
3    Ireland    4761865.0             NaT   True
4      Spain   46549045.0      2017-06-01   True
5    Vatican          NaN             NaT   True

The content of dicts is

[{'country': 'Germany', 'population': 82521653.0, 'population_time': Timestamp('2016-12-01 00:00:00'), 'EUR': True},
 {'country': 'France', 'population': 66991000.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': True},
 {'country': 'Indonesia', 'population': 255461700.0, 'population_time': Timestamp('2017-01-01 00:00:00'), 'EUR': False},
 {'country': 'Ireland', 'population': 4761865.0, 'population_time': NaT, 'EUR': True},
 {'country': 'Spain', 'population': 46549045.0, 'population_time': Timestamp('2017-06-01 00:00:00'), 'EUR': True},
 {'country': 'Vatican', 'population': nan, 'population_time': NaT, 'EUR': True}]

Pandas #3

import pandas as pd

# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()

# Convert
lists = [[row[col] for col in df.columns] for row in df.to_dict('records')]

The content of lists is:

[['Germany', 82521653.0, Timestamp('2016-12-01 00:00:00'), True],
 ['France', 66991000.0, Timestamp('2017-01-01 00:00:00'), True],
 ['Indonesia', 255461700.0, Timestamp('2017-01-01 00:00:00'), False],
 ['Ireland', 4761865.0, NaT, True],
 ['Spain', 46549045.0, Timestamp('2017-06-01 00:00:00'), True],
 ['Vatican', nan, NaT, True]]

回答 3

Python3更新:

import csv
from pprint import pprint

with open('text.csv', newline='') as file:
    reader = csv.reader(file)
    res = list(map(tuple, reader))

pprint(res)

输出:

[('This is the first line', ' Line1'),
 ('This is the second line', ' Line2'),
 ('This is the third line', ' Line3')]

如果csvfile是文件对象,则应使用打开newline=''
CSV模组

Update for Python3:

import csv
from pprint import pprint

with open('text.csv', newline='') as file:
    reader = csv.reader(file)
    res = list(map(tuple, reader))

pprint(res)

Output:

[('This is the first line', ' Line1'),
 ('This is the second line', ' Line2'),
 ('This is the third line', ' Line3')]

If csvfile is a file object, it should be opened with newline=''.
csv module


回答 4

如果你相信有您的输入没有逗号,以外的其他类别分开,你可以逐行读取文件中的行分裂,,然后推结果List

也就是说,您似乎正在查看CSV文件,因此您可以考虑为其使用模块

If you are sure there are no commas in your input, other than to separate the category, you can read the file line by line and split on ,, then push the result to List

That said, it looks like you are looking at a CSV file, so you might consider using the modules for it


回答 5

result = []
for line in text.splitlines():
    result.append(tuple(line.split(",")))
result = []
for line in text.splitlines():
    result.append(tuple(line.split(",")))

回答 6

正如评论中已经说过的那样,您可以csv在python中使用该库。csv的意思是逗号分隔的值,这似乎与您的情况完全相同:标签和由逗号分隔的值。

作为类别和值类型,我宁愿使用字典类型而不是元组列表。

无论如何,在下面的代码中,我都会同时显示两种方式:d是字典,l是元组列表。

import csv

file_name = "test.txt"
try:
    csvfile = open(file_name, 'rt')
except:
    print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
d = dict()
l =  list()
for row in csvReader:
    d[row[1]] = row[0]
    l.append((row[0], row[1]))
print(d)
print(l)

As said already in the comments you can use the csv library in python. csv means comma separated values which seems exactly your case: a label and a value separated by a comma.

Being a category and value type I would rather use a dictionary type instead of a list of tuples.

Anyway in the code below I show both ways: d is the dictionary and l is the list of tuples.

import csv

file_name = "test.txt"
try:
    csvfile = open(file_name, 'rt')
except:
    print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
d = dict()
l =  list()
for row in csvReader:
    d[row[1]] = row[0]
    l.append((row[0], row[1]))
print(d)
print(l)

回答 7

一个简单的循环就足够了:

lines = []
with open('test.txt', 'r') as f:
    for line in f.readlines():
        l,name = line.strip().split(',')
        lines.append((l,name))

print lines

A simple loop would suffice:

lines = []
with open('test.txt', 'r') as f:
    for line in f.readlines():
        l,name = line.strip().split(',')
        lines.append((l,name))

print lines

回答 8

不幸的是,我发现没有一个现有的答案特别令人满意。

这是一个使用csv模块的简单,完整的Python 3解决方案。

import csv

with open('../resources/temp_in.csv', newline='') as f:
    reader = csv.reader(f, skipinitialspace=True)
    rows = list(reader)

print(rows)

注意skipinitialspace=True参数。这是必要的,因为不幸的是,OP的CSV在每个逗号后都包含空格。

输出:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

Unfortunately I find none of the existing answers particularly satisfying.

Here is a straightforward and complete Python 3 solution, using the csv module.

import csv

with open('../resources/temp_in.csv', newline='') as f:
    reader = csv.reader(f, skipinitialspace=True)
    rows = list(reader)

print(rows)

Notice the skipinitialspace=True argument. This is necessary since, unfortunately, OP’s CSV contains whitespace after each comma.

Output:

[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]

回答 9

稍微扩展您的需求,并假设您不关心行的顺序,并希望将它们分组在类别下,则以下解决方案可能适用于您:

>>> fname = "lines.txt"
>>> from collections import defaultdict
>>> dct = defaultdict(list)
>>> with open(fname) as f:
...     for line in f:
...         text, cat = line.rstrip("\n").split(",", 1)
...         dct[cat].append(text)
...
>>> dct
defaultdict(<type 'list'>, {' CatA': ['This is the first line', 'This is the another line'], ' CatC': ['This is the third line'], ' CatB': ['This is the second line', 'This is the last line']})

这样,您可以在字典中键为类别下获得所有可用的相关行。

Extending your requirements a bit and assuming you do not care about the order of lines and want to get them grouped under categories, the following solution may work for you:

>>> fname = "lines.txt"
>>> from collections import defaultdict
>>> dct = defaultdict(list)
>>> with open(fname) as f:
...     for line in f:
...         text, cat = line.rstrip("\n").split(",", 1)
...         dct[cat].append(text)
...
>>> dct
defaultdict(<type 'list'>, {' CatA': ['This is the first line', 'This is the another line'], ' CatC': ['This is the third line'], ' CatB': ['This is the second line', 'This is the last line']})

This way you get all relevant lines available in the dictionary under key being the category.


回答 10

这是Python 3.x中最简单的将CSV导入多维数组的方法,它仅4行代码而无需导入任何内容!

#pull a CSV into a multidimensional array in 4 lines!

L=[]                            #Create an empty list for the main array
for line in open('log.txt'):    #Open the file and read all the lines
    x=line.rstrip()             #Strip the \n from each line
    L.append(x.split(','))      #Split each line into a list and add it to the
                                #Multidimensional array
print(L)

Here is the easiest way in Python 3.x to import a CSV to a multidimensional array, and its only 4 lines of code without importing anything!

#pull a CSV into a multidimensional array in 4 lines!

L=[]                            #Create an empty list for the main array
for line in open('log.txt'):    #Open the file and read all the lines
    x=line.rstrip()             #Strip the \n from each line
    L.append(x.split(','))      #Split each line into a list and add it to the
                                #Multidimensional array
print(L)

回答 11

接下来是一段代码,该代码使用csv模块,但使用第一行(即csv表的标头)将file.csv内容提取到字典列表中

import csv
def csv2dicts(filename):
  with open(filename, 'rb') as f:
    reader = csv.reader(f)
    lines = list(reader)
    if len(lines) < 2: return None
    names = lines[0]
    if len(names) < 1: return None
    dicts = []
    for values in lines[1:]:
      if len(values) != len(names): return None
      d = {}
      for i,_ in enumerate(names):
        d[names[i]] = values[i]
      dicts.append(d)
    return dicts
  return None

if __name__ == '__main__':
  your_list = csv2dicts('file.csv')
  print your_list

Next is a piece of code which uses csv module but extracts file.csv contents to a list of dicts using the first line which is a header of csv table

import csv
def csv2dicts(filename):
  with open(filename, 'rb') as f:
    reader = csv.reader(f)
    lines = list(reader)
    if len(lines) < 2: return None
    names = lines[0]
    if len(names) < 1: return None
    dicts = []
    for values in lines[1:]:
      if len(values) != len(names): return None
      d = {}
      for i,_ in enumerate(names):
        d[names[i]] = values[i]
      dicts.append(d)
    return dicts
  return None

if __name__ == '__main__':
  your_list = csv2dicts('file.csv')
  print your_list

如何将JSON转换为CSV?

问题:如何将JSON转换为CSV?

我有一个要转换为CSV文件的JSON文件。如何使用Python执行此操作?

我试过了:

import json
import csv

f = open('data.json')
data = json.load(f)
f.close()

f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
    csv_file.writerow(item)

f.close()

但是,它没有用。我正在使用Django,收到的错误是:

file' object has no attribute 'writerow'

然后,我尝试了以下方法:

import json
import csv

f = open('data.json')
data = json.load(f)
f.close()

f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
    f.writerow(item)  # ← changed

f.close()

然后我得到错误:

sequence expected

样本json文件:

[{
        "pk": 22,
        "model": "auth.permission",
        "fields": {
            "codename": "add_logentry",
            "name": "Can add log entry",
            "content_type": 8
        }
    }, {
        "pk": 23,
        "model": "auth.permission",
        "fields": {
            "codename": "change_logentry",
            "name": "Can change log entry",
            "content_type": 8
        }
    }, {
        "pk": 24,
        "model": "auth.permission",
        "fields": {
            "codename": "delete_logentry",
            "name": "Can delete log entry",
            "content_type": 8
        }
    }, {
        "pk": 4,
        "model": "auth.permission",
        "fields": {
            "codename": "add_group",
            "name": "Can add group",
            "content_type": 2
        }
    }, {
        "pk": 10,
        "model": "auth.permission",
        "fields": {
            "codename": "add_message",
            "name": "Can add message",
            "content_type": 4
        }
    }
]

I have a JSON file I want to convert to a CSV file. How can I do this with Python?

I tried:

import json
import csv

f = open('data.json')
data = json.load(f)
f.close()

f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
    csv_file.writerow(item)

f.close()

However, it did not work. I am using Django and the error I received is:

`file' object has no attribute 'writerow'`

I then tried the following:

import json
import csv

f = open('data.json')
data = json.load(f)
f.close()

f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
    f.writerow(item)  # ← changed

f.close()

I then get the error:

`sequence expected`

Sample json file:

[{
        "pk": 22,
        "model": "auth.permission",
        "fields": {
            "codename": "add_logentry",
            "name": "Can add log entry",
            "content_type": 8
        }
    }, {
        "pk": 23,
        "model": "auth.permission",
        "fields": {
            "codename": "change_logentry",
            "name": "Can change log entry",
            "content_type": 8
        }
    }, {
        "pk": 24,
        "model": "auth.permission",
        "fields": {
            "codename": "delete_logentry",
            "name": "Can delete log entry",
            "content_type": 8
        }
    }, {
        "pk": 4,
        "model": "auth.permission",
        "fields": {
            "codename": "add_group",
            "name": "Can add group",
            "content_type": 2
        }
    }, {
        "pk": 10,
        "model": "auth.permission",
        "fields": {
            "codename": "add_message",
            "name": "Can add message",
            "content_type": 4
        }
    }
]

回答 0

首先,您的JSON具有嵌套对象,因此通常无法直接将其转换为CSV。您需要将其更改为以下内容:

{
    "pk": 22,
    "model": "auth.permission",
    "codename": "add_logentry",
    "content_type": 8,
    "name": "Can add log entry"
},
......]

这是从中生成CSV的代码:

import csv
import json

x = """[
    {
        "pk": 22,
        "model": "auth.permission",
        "fields": {
            "codename": "add_logentry",
            "name": "Can add log entry",
            "content_type": 8
        }
    },
    {
        "pk": 23,
        "model": "auth.permission",
        "fields": {
            "codename": "change_logentry",
            "name": "Can change log entry",
            "content_type": 8
        }
    },
    {
        "pk": 24,
        "model": "auth.permission",
        "fields": {
            "codename": "delete_logentry",
            "name": "Can delete log entry",
            "content_type": 8
        }
    }
]"""

x = json.loads(x)

f = csv.writer(open("test.csv", "wb+"))

# Write CSV Header, If you dont need that, remove this line
f.writerow(["pk", "model", "codename", "name", "content_type"])

for x in x:
    f.writerow([x["pk"],
                x["model"],
                x["fields"]["codename"],
                x["fields"]["name"],
                x["fields"]["content_type"]])

您将获得以下输出:

pk,model,codename,name,content_type
22,auth.permission,add_logentry,Can add log entry,8
23,auth.permission,change_logentry,Can change log entry,8
24,auth.permission,delete_logentry,Can delete log entry,8

First, your JSON has nested objects, so it normally cannot be directly converted to CSV. You need to change that to something like this:

{
    "pk": 22,
    "model": "auth.permission",
    "codename": "add_logentry",
    "content_type": 8,
    "name": "Can add log entry"
},
......]

Here is my code to generate CSV from that:

import csv
import json

x = """[
    {
        "pk": 22,
        "model": "auth.permission",
        "fields": {
            "codename": "add_logentry",
            "name": "Can add log entry",
            "content_type": 8
        }
    },
    {
        "pk": 23,
        "model": "auth.permission",
        "fields": {
            "codename": "change_logentry",
            "name": "Can change log entry",
            "content_type": 8
        }
    },
    {
        "pk": 24,
        "model": "auth.permission",
        "fields": {
            "codename": "delete_logentry",
            "name": "Can delete log entry",
            "content_type": 8
        }
    }
]"""

x = json.loads(x)

f = csv.writer(open("test.csv", "wb+"))

# Write CSV Header, If you dont need that, remove this line
f.writerow(["pk", "model", "codename", "name", "content_type"])

for x in x:
    f.writerow([x["pk"],
                x["model"],
                x["fields"]["codename"],
                x["fields"]["name"],
                x["fields"]["content_type"]])

You will get output as:

pk,model,codename,name,content_type
22,auth.permission,add_logentry,Can add log entry,8
23,auth.permission,change_logentry,Can change log entry,8
24,auth.permission,delete_logentry,Can delete log entry,8

回答 1

使用pandas 这就像使用两个命令一样简单!

pandas.read_json()

要将JSON字符串转换为pandas对象(序列或数据框)。然后,假设结果存储为df

df.to_csv()

它可以返回字符串,也可以直接写入csv文件。

基于先前答案的冗长性,我们都应该感谢熊猫的捷径。

With the pandas library, this is as easy as using two commands!

pandas.read_json()

To convert a JSON string to a pandas object (either a series or dataframe). Then, assuming the results were stored as df:

df.to_csv()

Which can either return a string or write directly to a csv-file.

Based on the verbosity of previous answers, we should all thank pandas for the shortcut.


回答 2

我假设您的JSON文件将解码为词典列表。首先,我们需要一个将JSON对象展平的函数:

def flattenjson( b, delim ):
    val = {}
    for i in b.keys():
        if isinstance( b[i], dict ):
            get = flattenjson( b[i], delim )
            for j in get.keys():
                val[ i + delim + j ] = get[j]
        else:
            val[i] = b[i]

    return val

在JSON对象上运行此代码段的结果:

flattenjson( {
    "pk": 22, 
    "model": "auth.permission", 
    "fields": {
      "codename": "add_message", 
      "name": "Can add message", 
      "content_type": 8
    }
  }, "__" )

{
    "pk": 22, 
    "model": "auth.permission', 
    "fields__codename": "add_message", 
    "fields__name": "Can add message", 
    "fields__content_type": 8
}

在将此函数应用于JSON对象输入数组中的每个dict之后:

input = map( lambda x: flattenjson( x, "__" ), input )

并找到相关的列名:

columns = [ x for row in input for x in row.keys() ]
columns = list( set( columns ) )

通过csv模块运行它并不难:

with open( fname, 'wb' ) as out_file:
    csv_w = csv.writer( out_file )
    csv_w.writerow( columns )

    for i_r in input:
        csv_w.writerow( map( lambda x: i_r.get( x, "" ), columns ) )

我希望这有帮助!

I am assuming that your JSON file will decode into a list of dictionaries. First we need a function which will flatten the JSON objects:

def flattenjson( b, delim ):
    val = {}
    for i in b.keys():
        if isinstance( b[i], dict ):
            get = flattenjson( b[i], delim )
            for j in get.keys():
                val[ i + delim + j ] = get[j]
        else:
            val[i] = b[i]

    return val

The result of running this snippet on your JSON object:

flattenjson( {
    "pk": 22, 
    "model": "auth.permission", 
    "fields": {
      "codename": "add_message", 
      "name": "Can add message", 
      "content_type": 8
    }
  }, "__" )

is

{
    "pk": 22, 
    "model": "auth.permission', 
    "fields__codename": "add_message", 
    "fields__name": "Can add message", 
    "fields__content_type": 8
}

After applying this function to each dict in the input array of JSON objects:

input = map( lambda x: flattenjson( x, "__" ), input )

and finding the relevant column names:

columns = [ x for row in input for x in row.keys() ]
columns = list( set( columns ) )

it’s not hard to run this through the csv module:

with open( fname, 'wb' ) as out_file:
    csv_w = csv.writer( out_file )
    csv_w.writerow( columns )

    for i_r in input:
        csv_w.writerow( map( lambda x: i_r.get( x, "" ), columns ) )

I hope this helps!


回答 3

JSON可以代表各种各样的数据结构-JS“对象”大致类似于Python字典(带有字符串键),JS“数组”大致类似于Python列表,并且您可以嵌套它们,只要最后一个“叶”元素是数字或字符串。

CSV本质上只能表示一个二维表-可选地带有“标题”的第一行,即“列名”,这可以使该表可解释为字典列表,而不是通常的解释,而是列表列表(同样,“叶”元素可以是数字或字符串)。

因此,在一般情况下,您无法将任意JSON结构转换为CSV。在某些特殊情况下,您可以(没有进一步嵌套的数组的阵列;都具有完全相同的键的对象的阵列)。哪种特殊情况(如果有)适用于您的问题?解决方案的详细信息取决于您的特殊情况。考虑到您甚至没有提到哪个适用的惊人事实,我怀疑您可能没有考虑过约束,实际上没有可用的案例适用,并且您的问题无法解决。但是请澄清一下!

JSON can represent a wide variety of data structures — a JS “object” is roughly like a Python dict (with string keys), a JS “array” roughly like a Python list, and you can nest them as long as the final “leaf” elements are numbers or strings.

CSV can essentially represent only a 2-D table — optionally with a first row of “headers”, i.e., “column names”, which can make the table interpretable as a list of dicts, instead of the normal interpretation, a list of lists (again, “leaf” elements can be numbers or strings).

So, in the general case, you can’t translate an arbitrary JSON structure to a CSV. In a few special cases you can (array of arrays with no further nesting; arrays of objects which all have exactly the same keys). Which special case, if any, applies to your problem? The details of the solution depend on which special case you do have. Given the astonishing fact that you don’t even mention which one applies, I suspect you may not have considered the constraint, neither usable case in fact applies, and your problem is impossible to solve. But please do clarify!


回答 4

通用解决方案,可将平面对象的任何json列表转换为csv。

将input.json文件作为第一个参数传递给命令行。

import csv, json, sys

input = open(sys.argv[1])
data = json.load(input)
input.close()

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for row in data:
    output.writerow(row.values())

A generic solution which translates any json list of flat objects to csv.

Pass the input.json file as first argument on command line.

import csv, json, sys

input = open(sys.argv[1])
data = json.load(input)
input.close()

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for row in data:
    output.writerow(row.values())

回答 5

假设您的JSON数据位于名为的文件中,那么这段代码应该对您有用data.json

import json
import csv

with open("data.json") as file:
    data = json.load(file)

with open("data.csv", "w") as file:
    csv_file = csv.writer(file)
    for item in data:
        fields = list(item['fields'].values())
        csv_file.writerow([item['pk'], item['model']] + fields)

This code should work for you, assuming that your JSON data is in a file called data.json.

import json
import csv

with open("data.json") as file:
    data = json.load(file)

with open("data.csv", "w") as file:
    csv_file = csv.writer(file)
    for item in data:
        fields = list(item['fields'].values())
        csv_file.writerow([item['pk'], item['model']] + fields)

回答 6

它易于使用csv.DictWriter(),详细的实现可以像这样:

def read_json(filename):
    return json.loads(open(filename).read())
def write_csv(data,filename):
    with open(filename, 'w+') as outf:
        writer = csv.DictWriter(outf, data[0].keys())
        writer.writeheader()
        for row in data:
            writer.writerow(row)
# implement
write_csv(read_json('test.json'), 'output.csv')

请注意,这假设您的所有JSON对象都具有相同的字段。

这是可以帮助您的参考

It’ll be easy to use csv.DictWriter(),the detailed implementation can be like this:

def read_json(filename):
    return json.loads(open(filename).read())
def write_csv(data,filename):
    with open(filename, 'w+') as outf:
        writer = csv.DictWriter(outf, data[0].keys())
        writer.writeheader()
        for row in data:
            writer.writerow(row)
# implement
write_csv(read_json('test.json'), 'output.csv')

Note that this assumes that all of your JSON objects have the same fields.

Here is the reference which may help you.


回答 7

我在Dan提出的解决方案上遇到了麻烦,但这对我有用:

import json
import csv 

f = open('test.json')
data = json.load(f)
f.close()

f=csv.writer(open('test.csv','wb+'))

for item in data:
  f.writerow([item['pk'], item['model']] + item['fields'].values())

其中“ test.json”包含以下内容:

[ 
{"pk": 22, "model": "auth.permission", "fields": 
  {"codename": "add_logentry", "name": "Can add log entry", "content_type": 8 } }, 
{"pk": 23, "model": "auth.permission", "fields": 
  {"codename": "change_logentry", "name": "Can change log entry", "content_type": 8 } }, {"pk": 24, "model": "auth.permission", "fields": 
  {"codename": "delete_logentry", "name": "Can delete log entry", "content_type": 8 } }
]

I was having trouble with Dan’s proposed solution, but this worked for me:

import json
import csv 

f = open('test.json')
data = json.load(f)
f.close()

f=csv.writer(open('test.csv','wb+'))

for item in data:
  f.writerow([item['pk'], item['model']] + item['fields'].values())

Where “test.json” contained the following:

[ 
{"pk": 22, "model": "auth.permission", "fields": 
  {"codename": "add_logentry", "name": "Can add log entry", "content_type": 8 } }, 
{"pk": 23, "model": "auth.permission", "fields": 
  {"codename": "change_logentry", "name": "Can change log entry", "content_type": 8 } }, {"pk": 24, "model": "auth.permission", "fields": 
  {"codename": "delete_logentry", "name": "Can delete log entry", "content_type": 8 } }
]

回答 8

json_normalize从使用pandas

  • 根据提供的数据,将其命名为 test.json
  • encoding='utf-8' 可能没有必要。
  • 以下代码利用了该pathlib
    • .open 是一种方法 pathlib
    • 也适用于非Windows路径
import pandas as pd
# As of Pandas 1.01, json_normalize as pandas.io.json.json_normalize is deprecated and is now exposed in the top-level namespace.
# from pandas.io.json import json_normalize
from pathlib import Path
import json

# set path to file
p = Path(r'c:\some_path_to_file\test.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create dataframe
df = pd.json_normalize(data)

# dataframe view
 pk            model  fields.codename           fields.name  fields.content_type
 22  auth.permission     add_logentry     Can add log entry                    8
 23  auth.permission  change_logentry  Can change log entry                    8
 24  auth.permission  delete_logentry  Can delete log entry                    8
  4  auth.permission        add_group         Can add group                    2
 10  auth.permission      add_message       Can add message                    4

# save to csv
df.to_csv('test.csv', index=False, encoding='utf-8')

CSV输出:

pk,model,fields.codename,fields.name,fields.content_type
22,auth.permission,add_logentry,Can add log entry,8
23,auth.permission,change_logentry,Can change log entry,8
24,auth.permission,delete_logentry,Can delete log entry,8
4,auth.permission,add_group,Can add group,2
10,auth.permission,add_message,Can add message,4

有关更多嵌套JSON对象的其他资源:

Use json_normalize from pandas:

  • Given the data provided, in a file named test.json.
  • encoding='utf-8' may not be necessary.
  • The following code takes advantage of the pathlib library.
  • .open is a method of pathlib.
  • Works with non-Windows paths too.
import pandas as pd
# As of Pandas 1.01, json_normalize as pandas.io.json.json_normalize is deprecated and is now exposed in the top-level namespace.
# from pandas.io.json import json_normalize
from pathlib import Path
import json

# set path to file
p = Path(r'c:\some_path_to_file\test.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create dataframe
df = pd.json_normalize(data)

# dataframe view
 pk            model  fields.codename           fields.name  fields.content_type
 22  auth.permission     add_logentry     Can add log entry                    8
 23  auth.permission  change_logentry  Can change log entry                    8
 24  auth.permission  delete_logentry  Can delete log entry                    8
  4  auth.permission        add_group         Can add group                    2
 10  auth.permission      add_message       Can add message                    4

# save to csv
df.to_csv('test.csv', index=False, encoding='utf-8')

CSV Output:

pk,model,fields.codename,fields.name,fields.content_type
22,auth.permission,add_logentry,Can add log entry,8
23,auth.permission,change_logentry,Can change log entry,8
24,auth.permission,delete_logentry,Can delete log entry,8
4,auth.permission,add_group,Can add group,2
10,auth.permission,add_message,Can add message,4

Other Resources for more heavily nested JSON objects:


回答 9

正如前面的答案中提到的,将json转换为csv的困难是因为json文件可以包含嵌套的字典,因此是多维数据结构,而csv是2D数据结构。但是,将多维结构转换为csv的一种好方法是让多个csv与主键绑定在一起。

在您的示例中,第一个csv输出具有“ pk”,“ model”,“ fields”列作为您的列。“ pk”和“ model”的值很容易获得,但是由于“字段”列包含一个字典,因此它应该是其自己的csv,并且因为“代号”似乎是主键,因此可以用作输入为“字段”完成第一个csv。第二个csv包含“字段”列中的词典,其代号为主键,可用于将2个csv绑在一起。

这是为您的json文件提供的解决方案,它将嵌套词典转换为2个csvs。

import csv
import json

def readAndWrite(inputFileName, primaryKey=""):
    input = open(inputFileName+".json")
    data = json.load(input)
    input.close()

    header = set()

    if primaryKey != "":
        outputFileName = inputFileName+"-"+primaryKey
        if inputFileName == "data":
            for i in data:
                for j in i["fields"].keys():
                    if j not in header:
                        header.add(j)
    else:
        outputFileName = inputFileName
        for i in data:
            for j in i.keys():
                if j not in header:
                    header.add(j)

    with open(outputFileName+".csv", 'wb') as output_file:
        fieldnames = list(header)
        writer = csv.DictWriter(output_file, fieldnames, delimiter=',', quotechar='"')
        writer.writeheader()
        for x in data:
            row_value = {}
            if primaryKey == "":
                for y in x.keys():
                    yValue = x.get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                    else:
                        if inputFileName == "data":
                            row_value[y] = yValue["codename"].encode('utf8')
                            readAndWrite(inputFileName, primaryKey="codename")
                writer.writerow(row_value)
            elif primaryKey == "codename":
                for y in x["fields"].keys():
                    yValue = x["fields"].get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                writer.writerow(row_value)

readAndWrite("data")

As mentioned in the previous answers the difficulty in converting json to csv is because a json file can contain nested dictionaries and therefore be a multidimensional data structure verses a csv which is a 2D data structure. However, a good way to turn a multidimensional structure to a csv is to have multiple csvs that tie together with primary keys.

In your example, the first csv output has the columns “pk”,”model”,”fields” as your columns. Values for “pk”, and “model” are easy to get but because the “fields” column contains a dictionary, it should be its own csv and because “codename” appears to the be the primary key, you can use as the input for “fields” to complete the first csv. The second csv contains the dictionary from the “fields” column with codename as the the primary key that can be used to tie the 2 csvs together.

Here is a solution for your json file which converts a nested dictionaries to 2 csvs.

import csv
import json

def readAndWrite(inputFileName, primaryKey=""):
    input = open(inputFileName+".json")
    data = json.load(input)
    input.close()

    header = set()

    if primaryKey != "":
        outputFileName = inputFileName+"-"+primaryKey
        if inputFileName == "data":
            for i in data:
                for j in i["fields"].keys():
                    if j not in header:
                        header.add(j)
    else:
        outputFileName = inputFileName
        for i in data:
            for j in i.keys():
                if j not in header:
                    header.add(j)

    with open(outputFileName+".csv", 'wb') as output_file:
        fieldnames = list(header)
        writer = csv.DictWriter(output_file, fieldnames, delimiter=',', quotechar='"')
        writer.writeheader()
        for x in data:
            row_value = {}
            if primaryKey == "":
                for y in x.keys():
                    yValue = x.get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                    else:
                        if inputFileName == "data":
                            row_value[y] = yValue["codename"].encode('utf8')
                            readAndWrite(inputFileName, primaryKey="codename")
                writer.writerow(row_value)
            elif primaryKey == "codename":
                for y in x["fields"].keys():
                    yValue = x["fields"].get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                writer.writerow(row_value)

readAndWrite("data")

回答 10

我知道问这个问题已经有很长时间了,但是我想我可以添加到其他所有人的答案中,并分享一篇博客文章,我认为它可以非常简洁地说明解决方案。

这是链接

打开文件进行写入

employ_data = open('/tmp/EmployData.csv', 'w')

创建csv writer对象

csvwriter = csv.writer(employ_data)
count = 0
for emp in emp_data:
      if count == 0:
             header = emp.keys()
             csvwriter.writerow(header)
             count += 1
      csvwriter.writerow(emp.values())

确保关闭文件以保存内容

employ_data.close()

I know it has been a long time since this question has been asked but I thought I might add to everyone else’s answer and share a blog post that I think explain the solution in a very concise way.

Here is the link

Open a file for writing

employ_data = open('/tmp/EmployData.csv', 'w')

Create the csv writer object

csvwriter = csv.writer(employ_data)
count = 0
for emp in emp_data:
      if count == 0:
             header = emp.keys()
             csvwriter.writerow(header)
             count += 1
      csvwriter.writerow(emp.values())

Make sure to close the file in order to save the contents

employ_data.close()

回答 11

这不是一个很聪明的方法,但是我遇到了同样的问题,这对我有用:

import csv

f = open('data.json')
data = json.load(f)
f.close()

new_data = []

for i in data:
   flat = {}
   names = i.keys()
   for n in names:
      try:
         if len(i[n].keys()) > 0:
            for ii in i[n].keys():
               flat[n+"_"+ii] = i[n][ii]
      except:
         flat[n] = i[n]
   new_data.append(flat)  

f = open(filename, "r")
writer = csv.DictWriter(f, new_data[0].keys())
writer.writeheader()
for row in new_data:
   writer.writerow(row)
f.close()

It is not a very smart way to do it, but I have had the same problem and this worked for me:

import csv

f = open('data.json')
data = json.load(f)
f.close()

new_data = []

for i in data:
   flat = {}
   names = i.keys()
   for n in names:
      try:
         if len(i[n].keys()) > 0:
            for ii in i[n].keys():
               flat[n+"_"+ii] = i[n][ii]
      except:
         flat[n] = i[n]
   new_data.append(flat)  

f = open(filename, "r")
writer = csv.DictWriter(f, new_data[0].keys())
writer.writeheader()
for row in new_data:
   writer.writerow(row)
f.close()

回答 12

Alec的答案很好,但是在多层嵌套的情况下,它是行不通的。这是修改后的版本,支持多层嵌套。如果嵌套对象已经指定了自己的键(例如,Firebase Analytics / BigTable / BigQuery数据),它还可以使标头名称更好:

"""Converts JSON with nested fields into a flattened CSV file.
"""

import sys
import json
import csv
import os

import jsonlines

from orderedset import OrderedSet

# from https://stackoverflow.com/a/28246154/473201
def flattenjson( b, prefix='', delim='/', val=None ):
  if val == None:
    val = {}

  if isinstance( b, dict ):
    for j in b.keys():
      flattenjson(b[j], prefix + delim + j, delim, val)
  elif isinstance( b, list ):
    get = b
    for j in range(len(get)):
      key = str(j)

      # If the nested data contains its own key, use that as the header instead.
      if isinstance( get[j], dict ):
        if 'key' in get[j]:
          key = get[j]['key']

      flattenjson(get[j], prefix + delim + key, delim, val)
  else:
    val[prefix] = b

  return val

def main(argv):
  if len(argv) < 2:
    raise Error('Please specify a JSON file to parse')

  filename = argv[1]
  allRows = []
  fieldnames = OrderedSet()
  with jsonlines.open(filename) as reader:
    for obj in reader:
      #print obj
      flattened = flattenjson(obj)
      #print 'keys: %s' % flattened.keys()
      fieldnames.update(flattened.keys())
      allRows.append(flattened)

  outfilename = filename + '.csv'
  with open(outfilename, 'w') as file:
    csvwriter = csv.DictWriter(file, fieldnames=fieldnames)
    csvwriter.writeheader()
    for obj in allRows:
      csvwriter.writerow(obj)



if __name__ == '__main__':
  main(sys.argv)

Alec’s answer is great, but it doesn’t work in the case where there are multiple levels of nesting. Here’s a modified version that supports multiple levels of nesting. It also makes the header names a bit nicer if the nested object already specifies its own key (e.g. Firebase Analytics / BigTable / BigQuery data):

"""Converts JSON with nested fields into a flattened CSV file.
"""

import sys
import json
import csv
import os

import jsonlines

from orderedset import OrderedSet

# from https://stackoverflow.com/a/28246154/473201
def flattenjson( b, prefix='', delim='/', val=None ):
  if val is None:
    val = {}

  if isinstance( b, dict ):
    for j in b.keys():
      flattenjson(b[j], prefix + delim + j, delim, val)
  elif isinstance( b, list ):
    get = b
    for j in range(len(get)):
      key = str(j)

      # If the nested data contains its own key, use that as the header instead.
      if isinstance( get[j], dict ):
        if 'key' in get[j]:
          key = get[j]['key']

      flattenjson(get[j], prefix + delim + key, delim, val)
  else:
    val[prefix] = b

  return val

def main(argv):
  if len(argv) < 2:
    raise Error('Please specify a JSON file to parse')

  print "Loading and Flattening..."
  filename = argv[1]
  allRows = []
  fieldnames = OrderedSet()
  with jsonlines.open(filename) as reader:
    for obj in reader:
      # print 'orig:\n'
      # print obj
      flattened = flattenjson(obj)
      #print 'keys: %s' % flattened.keys()
      # print 'flattened:\n'
      # print flattened
      fieldnames.update(flattened.keys())
      allRows.append(flattened)

  print "Exporting to CSV..."
  outfilename = filename + '.csv'
  count = 0
  with open(outfilename, 'w') as file:
    csvwriter = csv.DictWriter(file, fieldnames=fieldnames)
    csvwriter.writeheader()
    for obj in allRows:
      # print 'allRows:\n'
      # print obj
      csvwriter.writerow(obj)
      count += 1

  print "Wrote %d rows" % count



if __name__ == '__main__':
  main(sys.argv)

回答 13

这相对较好。它将json展平以将其写入csv文件。嵌套元素被管理:)

那是为了python 3

import json

o = json.loads('your json string') # Be careful, o must be a list, each of its objects will make a line of the csv.

def flatten(o, k='/'):
    global l, c_line
    if isinstance(o, dict):
        for key, value in o.items():
            flatten(value, k + '/' + key)
    elif isinstance(o, list):
        for ov in o:
            flatten(ov, '')
    elif isinstance(o, str):
        o = o.replace('\r',' ').replace('\n',' ').replace(';', ',')
        if not k in l:
            l[k]={}
        l[k][c_line]=o

def render_csv(l):
    ftime = True

    for i in range(100): #len(l[list(l.keys())[0]])
        for k in l:
            if ftime :
                print('%s;' % k, end='')
                continue
            v = l[k]
            try:
                print('%s;' % v[i], end='')
            except:
                print(';', end='')
        print()
        ftime = False
        i = 0

def json_to_csv(object_list):
    global l, c_line
    l = {}
    c_line = 0
    for ov in object_list : # Assumes json is a list of objects
        flatten(ov)
        c_line += 1
    render_csv(l)

json_to_csv(o)

请享用。

This works relatively well. It flattens the json to write it to a csv file. Nested elements are managed :)

That’s for python 3

import json

o = json.loads('your json string') # Be careful, o must be a list, each of its objects will make a line of the csv.

def flatten(o, k='/'):
    global l, c_line
    if isinstance(o, dict):
        for key, value in o.items():
            flatten(value, k + '/' + key)
    elif isinstance(o, list):
        for ov in o:
            flatten(ov, '')
    elif isinstance(o, str):
        o = o.replace('\r',' ').replace('\n',' ').replace(';', ',')
        if not k in l:
            l[k]={}
        l[k][c_line]=o

def render_csv(l):
    ftime = True

    for i in range(100): #len(l[list(l.keys())[0]])
        for k in l:
            if ftime :
                print('%s;' % k, end='')
                continue
            v = l[k]
            try:
                print('%s;' % v[i], end='')
            except:
                print(';', end='')
        print()
        ftime = False
        i = 0

def json_to_csv(object_list):
    global l, c_line
    l = {}
    c_line = 0
    for ov in object_list : # Assumes json is a list of objects
        flatten(ov)
        c_line += 1
    render_csv(l)

json_to_csv(o)

enjoy.


回答 14

我解决这个问题的简单方法:

创建一个新的Python文件,例如:json_to_csv.py

添加此代码:

import csv, json, sys
#if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8")
#check if you pass the input file and output file
if sys.argv[1] is not None and sys.argv[2] is not None:

    fileInput = sys.argv[1]
    fileOutput = sys.argv[2]

    inputFile = open(fileInput)
    outputFile = open(fileOutput, 'w')
    data = json.load(inputFile)
    inputFile.close()

    output = csv.writer(outputFile)

    output.writerow(data[0].keys())  # header row

    for row in data:
        output.writerow(row.values())

添加此代码后,保存文件并在终端上运行:

python json_to_csv.py input.txt output.csv

希望对您有所帮助。

拜拜!

My simple way to solve this:

Create a new Python file like: json_to_csv.py

Add this code:

import csv, json, sys
#if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8")
#check if you pass the input file and output file
if sys.argv[1] is not None and sys.argv[2] is not None:

    fileInput = sys.argv[1]
    fileOutput = sys.argv[2]

    inputFile = open(fileInput)
    outputFile = open(fileOutput, 'w')
    data = json.load(inputFile)
    inputFile.close()

    output = csv.writer(outputFile)

    output.writerow(data[0].keys())  # header row

    for row in data:
        output.writerow(row.values())

After add this code, save the file and run at the terminal:

python json_to_csv.py input.txt output.csv

I hope this help you.

SEEYA!


回答 15

令人惊讶的是,我发现到目前为止,这里发布的所有答案都无法正确处理所有可能的情况(例如,嵌套的字典,嵌套的列表,无值等)。

该解决方案应适用于所有情况:

def flatten_json(json):
    def process_value(keys, value, flattened):
        if isinstance(value, dict):
            for key in value.keys():
                process_value(keys + [key], value[key], flattened)
        elif isinstance(value, list):
            for idx, v in enumerate(value):
                process_value(keys + [str(idx)], v, flattened)
        else:
            flattened['__'.join(keys)] = value

    flattened = {}
    for key in json.keys():
        process_value([key], json[key], flattened)
    return flattened

Surprisingly, I found that none of the answers posted here so far correctly deal with all possible scenarios (e.g., nested dicts, nested lists, None values, etc).

This solution should work across all scenarios:

def flatten_json(json):
    def process_value(keys, value, flattened):
        if isinstance(value, dict):
            for key in value.keys():
                process_value(keys + [key], value[key], flattened)
        elif isinstance(value, list):
            for idx, v in enumerate(value):
                process_value(keys + [str(idx)], v, flattened)
        else:
            flattened['__'.join(keys)] = value

    flattened = {}
    for key in json.keys():
        process_value([key], json[key], flattened)
    return flattened

回答 16

试试这个

import csv, json, sys

input = open(sys.argv[1])
data = json.load(input)
input.close()

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for item in data:
    output.writerow(item.values())

Try this

import csv, json, sys

input = open(sys.argv[1])
data = json.load(input)
input.close()

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for item in data:
    output.writerow(item.values())

回答 17

此代码适用于任何给定的json文件

# -*- coding: utf-8 -*-
"""
Created on Mon Jun 17 20:35:35 2019
author: Ram
"""

import json
import csv

with open("file1.json") as file:
    data = json.load(file)



# create the csv writer object
pt_data1 = open('pt_data1.csv', 'w')
csvwriter = csv.writer(pt_data1)

count = 0

for pt in data:

      if count == 0:

             header = pt.keys()

             csvwriter.writerow(header)

             count += 1

      csvwriter.writerow(pt.values())

pt_data1.close()

This code works for any given json file

# -*- coding: utf-8 -*-
"""
Created on Mon Jun 17 20:35:35 2019
author: Ram
"""

import json
import csv

with open("file1.json") as file:
    data = json.load(file)



# create the csv writer object
pt_data1 = open('pt_data1.csv', 'w')
csvwriter = csv.writer(pt_data1)

count = 0

for pt in data:

      if count == 0:

             header = pt.keys()

             csvwriter.writerow(header)

             count += 1

      csvwriter.writerow(pt.values())

pt_data1.close()

回答 18

修改了Alec McGail的答案以支持内部带有列表的JSON

    def flattenjson(self, mp, delim="|"):
            ret = []
            if isinstance(mp, dict):
                    for k in mp.keys():
                            csvs = self.flattenjson(mp[k], delim)
                            for csv in csvs:
                                    ret.append(k + delim + csv)
            elif isinstance(mp, list):
                    for k in mp:
                            csvs = self.flattenjson(k, delim)
                            for csv in csvs:
                                    ret.append(csv)
            else:
                    ret.append(mp)

            return ret

谢谢!

Modified Alec McGail’s answer to support JSON with lists inside

    def flattenjson(self, mp, delim="|"):
            ret = []
            if isinstance(mp, dict):
                    for k in mp.keys():
                            csvs = self.flattenjson(mp[k], delim)
                            for csv in csvs:
                                    ret.append(k + delim + csv)
            elif isinstance(mp, list):
                    for k in mp:
                            csvs = self.flattenjson(k, delim)
                            for csv in csvs:
                                    ret.append(csv)
            else:
                    ret.append(mp)

            return ret

Thanks!


回答 19

import json,csv
t=''
t=(type('a'))
json_data = []
data = None
write_header = True
item_keys = []
try:
with open('kk.json') as json_file:
    json_data = json_file.read()

    data = json.loads(json_data)
except Exception as e:
    print( e)

with open('bar.csv', 'at') as csv_file:
    writer = csv.writer(csv_file)#, quoting=csv.QUOTE_MINIMAL)
    for item in data:
        item_values = []
        for key in item:
            if write_header:
                item_keys.append(key)
            value = item.get(key, '')
            if (type(value)==t):
                item_values.append(value.encode('utf-8'))
            else:
                item_values.append(value)
        if write_header:
            writer.writerow(item_keys)
            write_header = False
        writer.writerow(item_values)
import json,csv
t=''
t=(type('a'))
json_data = []
data = None
write_header = True
item_keys = []
try:
with open('kk.json') as json_file:
    json_data = json_file.read()

    data = json.loads(json_data)
except Exception as e:
    print( e)

with open('bar.csv', 'at') as csv_file:
    writer = csv.writer(csv_file)#, quoting=csv.QUOTE_MINIMAL)
    for item in data:
        item_values = []
        for key in item:
            if write_header:
                item_keys.append(key)
            value = item.get(key, '')
            if (type(value)==t):
                item_values.append(value.encode('utf-8'))
            else:
                item_values.append(value)
        if write_header:
            writer.writerow(item_keys)
            write_header = False
        writer.writerow(item_values)

回答 20

如果我们考虑以下示例,将json格式的文件转换为csv格式的文件。

{
 "item_data" : [
      {
        "item": "10023456",
        "class": "100",
        "subclass": "123"
      }
      ]
}

以下代码将json文件(data3.json)转换为csv文件(data3.csv)。

import json
import csv
with open("/Users/Desktop/json/data3.json") as file:
    data = json.load(file)
    file.close()
    print(data)

fname = "/Users/Desktop/json/data3.csv"

with open(fname, "w", newline='') as file:
    csv_file = csv.writer(file)
    csv_file.writerow(['dept',
                       'class',
                       'subclass'])
    for item in data["item_data"]:
         csv_file.writerow([item.get('item_data').get('dept'),
                            item.get('item_data').get('class'),
                            item.get('item_data').get('subclass')])

上面提到的代码已在本地安装的pycharm中执行,并且已成功将json文件转换为csv文件。希望此帮助转换文件。

If we consider the below example for converting the json format file to csv formatted file.

{
 "item_data" : [
      {
        "item": "10023456",
        "class": "100",
        "subclass": "123"
      }
      ]
}

The below code will convert the json file ( data3.json ) to csv file ( data3.csv ).

import json
import csv
with open("/Users/Desktop/json/data3.json") as file:
    data = json.load(file)
    file.close()
    print(data)

fname = "/Users/Desktop/json/data3.csv"

with open(fname, "w", newline='') as file:
    csv_file = csv.writer(file)
    csv_file.writerow(['dept',
                       'class',
                       'subclass'])
    for item in data["item_data"]:
         csv_file.writerow([item.get('item_data').get('dept'),
                            item.get('item_data').get('class'),
                            item.get('item_data').get('subclass')])

The above mentioned code has been executed in the locally installed pycharm and it has successfully converted the json file to the csv file. Hope this help to convert the files.


回答 21

由于数据似乎是字典格式的,因此您似乎应该实际使用csv.DictWriter()来实际输出带有适当标题信息的行。这样可以使转换处理起来更加容易。然后,fieldnames参数将正确设置顺序,而第一行的输出作为标头将允许它稍后由csv.DictReader()读取和处理。

例如,Mike Repass使用

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for row in data:
  output.writerow(row.values())

但是,只需将初始设置更改为output = csv.DictWriter(filesetting,fieldnames = data [0] .keys())

请注意,由于未定义字典中元素的顺序,因此可能必须显式创建字段名称条目。一旦执行此操作,写行将起作用。然后,写入将按最初显示的方式工作。

Since the data appears to be in a dictionary format, it would appear that you should actually use csv.DictWriter() to actually output the lines with the appropriate header information. This should allow the conversion to be handled somewhat easier. The fieldnames parameter would then set up the order properly while the output of the first line as the headers would allow it to be read and processed later by csv.DictReader().

For example, Mike Repass used

output = csv.writer(sys.stdout)

output.writerow(data[0].keys())  # header row

for row in data:
  output.writerow(row.values())

However just change the initial setup to output = csv.DictWriter(filesetting, fieldnames=data[0].keys())

Note that since the order of elements in a dictionary is not defined, you might have to create fieldnames entries explicitly. Once you do that, the writerow will work. The writes then work as originally shown.


回答 22

不幸的是,我对获得惊人的@Alec McGail答案贡献不大。我正在使用Python3,需要将地图转换为@Alexis R注释后的列表。

另外,我发现csv编写器正在向文件添加一个额外的CR(我在csv文件中的每一行都有一行空行)。按照@Jason R. Coombs对这个线程的回答,解决方案非常简单: Python中的CSV添加了额外的回车符

您只需将lineterminator =’\ n’参数添加到csv.writer。这将是:csv_w = csv.writer( out_file, lineterminator='\n' )

Unfortunately I have not enouthg reputation to make a small contribution to the amazing @Alec McGail answer. I was using Python3 and I have needed to convert the map to a list following the @Alexis R comment.

Additionaly I have found the csv writer was adding a extra CR to the file (I have a empty line for each line with data inside the csv file). The solution was very easy following the @Jason R. Coombs answer to this thread: CSV in Python adding an extra carriage return

You need to simply add the lineterminator=’\n’ parameter to the csv.writer. It will be: csv_w = csv.writer( out_file, lineterminator='\n' )


回答 23

您可以使用此代码将json文件转换为csv文件读取文件后,我将对象转换为pandas数据框,然后将其保存到CSV文件

import os
import pandas as pd
import json
import numpy as np

data = []
os.chdir('D:\\Your_directory\\folder')
with open('file_name.json', encoding="utf8") as data_file:    
     for line in data_file:
        data.append(json.loads(line))

dataframe = pd.DataFrame(data)        
## Saving the dataframe to a csv file
dataframe.to_csv("filename.csv", encoding='utf-8',index= False)

You can use this code to convert a json file to csv file After reading the file, I am converting the object to pandas dataframe and then saving this to a CSV file

import os
import pandas as pd
import json
import numpy as np

data = []
os.chdir('D:\\Your_directory\\folder')
with open('file_name.json', encoding="utf8") as data_file:    
     for line in data_file:
        data.append(json.loads(line))

dataframe = pd.DataFrame(data)        
## Saving the dataframe to a csv file
dataframe.to_csv("filename.csv", encoding='utf-8',index= False)

回答 24

我可能参加聚会晚了,但我认为,我已经解决了类似的问题。我有一个看起来像这样的json文件

我只想从这些json文件中提取一些键/值。因此,我编写了以下代码以提取相同的代码。

    """json_to_csv.py
    This script reads n numbers of json files present in a folder and then extract certain data from each file and write in a csv file.
    The folder contains the python script i.e. json_to_csv.py, output.csv and another folder descriptions containing all the json files.
"""

import os
import json
import csv


def get_list_of_json_files():
    """Returns the list of filenames of all the Json files present in the folder
    Parameter
    ---------
    directory : str
        'descriptions' in this case
    Returns
    -------
    list_of_files: list
        List of the filenames of all the json files
    """

    list_of_files = os.listdir('descriptions')  # creates list of all the files in the folder

    return list_of_files


def create_list_from_json(jsonfile):
    """Returns a list of the extracted items from json file in the same order we need it.
    Parameter
    _________
    jsonfile : json
        The json file containing the data
    Returns
    -------
    one_sample_list : list
        The list of the extracted items needed for the final csv
    """

    with open(jsonfile) as f:
        data = json.load(f)

    data_list = []  # create an empty list

    # append the items to the list in the same order.
    data_list.append(data['_id'])
    data_list.append(data['_modelType'])
    data_list.append(data['creator']['_id'])
    data_list.append(data['creator']['name'])
    data_list.append(data['dataset']['_accessLevel'])
    data_list.append(data['dataset']['_id'])
    data_list.append(data['dataset']['description'])
    data_list.append(data['dataset']['name'])
    data_list.append(data['meta']['acquisition']['image_type'])
    data_list.append(data['meta']['acquisition']['pixelsX'])
    data_list.append(data['meta']['acquisition']['pixelsY'])
    data_list.append(data['meta']['clinical']['age_approx'])
    data_list.append(data['meta']['clinical']['benign_malignant'])
    data_list.append(data['meta']['clinical']['diagnosis'])
    data_list.append(data['meta']['clinical']['diagnosis_confirm_type'])
    data_list.append(data['meta']['clinical']['melanocytic'])
    data_list.append(data['meta']['clinical']['sex'])
    data_list.append(data['meta']['unstructured']['diagnosis'])
    # In few json files, the race was not there so using KeyError exception to add '' at the place
    try:
        data_list.append(data['meta']['unstructured']['race'])
    except KeyError:
        data_list.append("")  # will add an empty string in case race is not there.
    data_list.append(data['name'])

    return data_list


def write_csv():
    """Creates the desired csv file
    Parameters
    __________
    list_of_files : file
        The list created by get_list_of_json_files() method
    result.csv : csv
        The csv file containing the header only
    Returns
    _______
    result.csv : csv
        The desired csv file
    """

    list_of_files = get_list_of_json_files()
    for file in list_of_files:
        row = create_list_from_json(f'descriptions/{file}')  # create the row to be added to csv for each file (json-file)
        with open('output.csv', 'a') as c:
            writer = csv.writer(c)
            writer.writerow(row)
        c.close()


if __name__ == '__main__':
    write_csv()

我希望这将有所帮助。有关此代码如何工作的详细信息,请单击此处

I might be late to the party, but I think, I have dealt with the similar problem. I had a json file which looked like this

I only wanted to extract few keys/values from these json file. So, I wrote the following code to extract the same.

    """json_to_csv.py
    This script reads n numbers of json files present in a folder and then extract certain data from each file and write in a csv file.
    The folder contains the python script i.e. json_to_csv.py, output.csv and another folder descriptions containing all the json files.
"""

import os
import json
import csv


def get_list_of_json_files():
    """Returns the list of filenames of all the Json files present in the folder
    Parameter
    ---------
    directory : str
        'descriptions' in this case
    Returns
    -------
    list_of_files: list
        List of the filenames of all the json files
    """

    list_of_files = os.listdir('descriptions')  # creates list of all the files in the folder

    return list_of_files


def create_list_from_json(jsonfile):
    """Returns a list of the extracted items from json file in the same order we need it.
    Parameter
    _________
    jsonfile : json
        The json file containing the data
    Returns
    -------
    one_sample_list : list
        The list of the extracted items needed for the final csv
    """

    with open(jsonfile) as f:
        data = json.load(f)

    data_list = []  # create an empty list

    # append the items to the list in the same order.
    data_list.append(data['_id'])
    data_list.append(data['_modelType'])
    data_list.append(data['creator']['_id'])
    data_list.append(data['creator']['name'])
    data_list.append(data['dataset']['_accessLevel'])
    data_list.append(data['dataset']['_id'])
    data_list.append(data['dataset']['description'])
    data_list.append(data['dataset']['name'])
    data_list.append(data['meta']['acquisition']['image_type'])
    data_list.append(data['meta']['acquisition']['pixelsX'])
    data_list.append(data['meta']['acquisition']['pixelsY'])
    data_list.append(data['meta']['clinical']['age_approx'])
    data_list.append(data['meta']['clinical']['benign_malignant'])
    data_list.append(data['meta']['clinical']['diagnosis'])
    data_list.append(data['meta']['clinical']['diagnosis_confirm_type'])
    data_list.append(data['meta']['clinical']['melanocytic'])
    data_list.append(data['meta']['clinical']['sex'])
    data_list.append(data['meta']['unstructured']['diagnosis'])
    # In few json files, the race was not there so using KeyError exception to add '' at the place
    try:
        data_list.append(data['meta']['unstructured']['race'])
    except KeyError:
        data_list.append("")  # will add an empty string in case race is not there.
    data_list.append(data['name'])

    return data_list


def write_csv():
    """Creates the desired csv file
    Parameters
    __________
    list_of_files : file
        The list created by get_list_of_json_files() method
    result.csv : csv
        The csv file containing the header only
    Returns
    _______
    result.csv : csv
        The desired csv file
    """

    list_of_files = get_list_of_json_files()
    for file in list_of_files:
        row = create_list_from_json(f'descriptions/{file}')  # create the row to be added to csv for each file (json-file)
        with open('output.csv', 'a') as c:
            writer = csv.writer(c)
            writer.writerow(row)
        c.close()


if __name__ == '__main__':
    write_csv()

I hope this will help. For details on how this code work you can check here


回答 25

这是@MikeRepass答案的修改。此版本将CSV写入文件,并且适用于Python 2和Python 3。

import csv,json
input_file="data.json"
output_file="data.csv"
with open(input_file) as f:
    content=json.load(f)
try:
    context=open(output_file,'w',newline='') # Python 3
except TypeError:
    context=open(output_file,'wb') # Python 2
with context as file:
    writer=csv.writer(file)
    writer.writerow(content[0].keys()) # header row
    for row in content:
        writer.writerow(row.values())

This is a modification of @MikeRepass’s answer. This version writes the CSV to a file, and works for both Python 2 and Python 3.

import csv,json
input_file="data.json"
output_file="data.csv"
with open(input_file) as f:
    content=json.load(f)
try:
    context=open(output_file,'w',newline='') # Python 3
except TypeError:
    context=open(output_file,'wb') # Python 2
with context as file:
    writer=csv.writer(file)
    writer.writerow(content[0].keys()) # header row
    for row in content:
        writer.writerow(row.values())

使用Python列表中的值创建.csv文件

问题:使用Python列表中的值创建.csv文件

我正在尝试使用Python列表中的值创建一个.csv文件。当我在列表中打印值时,它们都是unicode(?),即它们看起来像这样

[u'value 1', u'value 2', ...]

如果我遍历列表中的值,即for v in mylist: print v它们似乎是纯文本。

我可以,在每个与print ','.join(mylist)

我可以输出到文件,即

myfile = open(...)
print >>myfile, ','.join(mylist)

但是我想输出到CSV并在列表中的值周围有定界符,例如

"value 1", "value 2", ... 

我找不到在格式中包含定界符的简单方法,例如,我已经尝试过该join语句。我怎样才能做到这一点?

I am trying to create a .csv file with the values from a Python list. When I print the values in the list they are all unicode (?), i.e. they look something like this

[u'value 1', u'value 2', ...]

If I iterate through the values in the list i.e. for v in mylist: print v they appear to be plain text.

And I can put a , between each with print ','.join(mylist)

And I can output to a file, i.e.

myfile = open(...)
print >>myfile, ','.join(mylist)

But I want to output to a CSV and have delimiters around the values in the list e.g.

"value 1", "value 2", ... 

I can’t find an easy way to include the delimiters in the formatting, e.g. I have tried through the join statement. How can I do this?


回答 0

import csv

with open(..., 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

编辑:这仅适用于python2.x。

为了使其与python 3.x wb一起工作,替换为w请参阅此SO答案

with open(..., 'w', newline='') as myfile:
     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
     wr.writerow(mylist)
import csv

with open(..., 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

Edit: this only works with python 2.x.

To make it work with python 3.x replace wb with w (see this SO answer)

with open(..., 'w', newline='') as myfile:
     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
     wr.writerow(mylist)

回答 1

这是Alex Martelli的安全版本:

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

Here is a secure version of Alex Martelli’s:

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

回答 2

对于另一种方法,可以在pandas中使用DataFrame:它可以轻松地将数据转储到csv中,就像下面的代码一样:

import pandas
df = pandas.DataFrame(data={"col1": list_1, "col2": list_2})
df.to_csv("./file.csv", sep=',',index=False)

For another approach, you can use DataFrame in pandas: And it can easily dump the data to csv just like the code below:

import pandas
df = pandas.DataFrame(data={"col1": list_1, "col2": list_2})
df.to_csv("./file.csv", sep=',',index=False)

回答 3

我发现最好的选择是使用savetxt来自numpy模块的

import numpy as np
np.savetxt("file_name.csv", data1, delimiter=",", fmt='%s', header=header)

如果您有多个列表需要堆叠

np.savetxt("file_name.csv", np.column_stack((data1, data2)), delimiter=",", fmt='%s', header=header)

The best option I’ve found was using the savetxt from the numpy module:

import numpy as np
np.savetxt("file_name.csv", data1, delimiter=",", fmt='%s', header=header)

In case you have multiple lists that need to be stacked

np.savetxt("file_name.csv", np.column_stack((data1, data2)), delimiter=",", fmt='%s', header=header)

回答 4

使用python的csv模块读取和写入逗号或制表符分隔的文件。首选csv模块,因为它可以使您更好地控制报价。

例如,这是为您准备的示例:

import csv
data = ["value %d" % i for i in range(1,4)]

out = csv.writer(open("myfile.csv","w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)

生成:

"value 1","value 2","value 3"

Use python’s csv module for reading and writing comma or tab-delimited files. The csv module is preferred because it gives you good control over quoting.

For example, here is the worked example for you:

import csv
data = ["value %d" % i for i in range(1,4)]

out = csv.writer(open("myfile.csv","w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)

Produces:

"value 1","value 2","value 3"

回答 5

在这种情况下,您可以使用string.join方法。

为了清晰起见,请分成几行-这是一个互动式会议

>>> a = ['a','b','c']
>>> first = '", "'.join(a)
>>> second = '"%s"' % first
>>> print second
"a", "b", "c"

或单行

>>> print ('"%s"') % '", "'.join(a)
"a", "b", "c"

但是,您可能会遇到问题,因为您的字符串具有嵌入的引号。如果是这种情况,则需要决定如何对其进行转义。

CSV模块可以照顾这一切为您,让您在各种报价选项中进行选择(所有领域,只能用引号和分隔符,唯一的非数字字段等字段),以及如何esacpe控制charecters(双引号,或转义的字符串)。如果您的值很简单,则string.join可能会没问题,但是如果您必须管理很多边缘情况,请使用可用的模块。

You could use the string.join method in this case.

Split over a few of lines for clarity – here’s an interactive session

>>> a = ['a','b','c']
>>> first = '", "'.join(a)
>>> second = '"%s"' % first
>>> print second
"a", "b", "c"

Or as a single line

>>> print ('"%s"') % '", "'.join(a)
"a", "b", "c"

However, you may have a problem is your strings have got embedded quotes. If this is the case you’ll need to decide how to escape them.

The CSV module can take care of all of this for you, allowing you to choose between various quoting options (all fields, only fields with quotes and seperators, only non numeric fields, etc) and how to esacpe control charecters (double quotes, or escaped strings). If your values are simple, string.join will probably be OK but if you’re having to manage lots of edge cases, use the module available.


回答 6

这个解决方案听起来很疯狂,但是像蜂蜜一样平稳

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
    wr.writerow(mylist)

该文件是由csvwriter写入的,因此csv属性得以保持,即逗号分隔。分隔符通过将列表项每次移至下一行来为主体提供帮助。

This solutions sounds crazy, but works smooth as honey

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
    wr.writerow(mylist)

The file is being written by csvwriter hence csv properties are maintained i.e. comma separated. The delimiter helps in the main part by moving list items to next line, each time.


回答 7

创建并写入csv文件

下面的示例演示如何创建和写入一个csv文件。要创建动态文件编写器,我们需要导入一个包import csv,然后需要使用open(“ D:\ sample.csv”,“ w”,newline =“”创建文件引用为Ex:-的文件实例。)作为file_writer

如果该文件不存在上述文件目录,则python将在指定目录中创建同一文件,“ w”代表写入,如果要读取文件,则将“ w”替换为“ r”或附加到现有文件,然后单击“ a”。newline =“”表示每次创建行时都会删除一个多余的空行,因此要消除空行,我们使用newline =“”,并使用诸如fields = [“ Names”,“ Age “,” Class“],然后 在此处使用Dictionary writer并分配列名,将其应用于writer实例,例如 writer = csv.DictWriter(file_writer,fieldnames = fields),以便将列名写入使用csv的csv中 ,而写入文件的值必须使用字典方法传递,这里的键是列名,而值是您各自的键值

import csv 

with open("D:\\sample.csv","w",newline="") as file_writer:

   fields=["Names","Age","Class"]

   writer=csv.DictWriter(file_writer,fieldnames=fields)

   writer.writeheader()

   writer.writerow({"Names":"John","Age":21,"Class":"12A"})

To create and write into a csv file

The below example demonstrate creating and writing a csv file. to make a dynamic file writer we need to import a package import csv, then need to create an instance of the file with file reference Ex:- with open(“D:\sample.csv”,”w”,newline=””) as file_writer

here if the file does not exist with the mentioned file directory then python will create a same file in the specified directory, and “w” represents write, if you want to read a file then replace “w” with “r” or to append to existing file then “a”. newline=”” specifies that it removes an extra empty row for every time you create row so to eliminate empty row we use newline=””, create some field names(column names) using list like fields=[“Names”,”Age”,”Class”], then apply to writer instance like writer=csv.DictWriter(file_writer,fieldnames=fields) here using Dictionary writer and assigning column names, to write column names to csv we use writer.writeheader() and to write values we use writer.writerow({“Names”:”John”,”Age”:20,”Class”:”12A”}) ,while writing file values must be passed using dictionary method , here the key is column name and value is your respective key value

import csv 

with open("D:\\sample.csv","w",newline="") as file_writer:

   fields=["Names","Age","Class"]

   writer=csv.DictWriter(file_writer,fieldnames=fields)

   writer.writeheader()

   writer.writerow({"Names":"John","Age":21,"Class":"12A"})

回答 8

Jupyter笔记本

假设您的清单是 A

然后,您可以编码以下广告,将其作为csv文件保存(仅列!)

R="\n".join(A)
f = open('Columns.csv','w')
f.write(R)
f.close()

Jupyter notebook

Lets say that your list is A

Then you can code the following ad you will have it as a csv file (columns only!)

R="\n".join(A)
f = open('Columns.csv','w')
f.write(R)
f.close()

回答 9

您应该确定使用CSV模块,但是有可能需要编写unicode。对于那些需要编写unicode的人,这是示例页面中的类,您可以将其用作util模块:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

def __iter__(self):
    return self

def next(self):
    return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    f = UTF8Recoder(f, encoding)
    self.reader = csv.reader(f, dialect=dialect, **kwds)

def next(self):
    row = self.reader.next()
    return [unicode(s, "utf-8") for s in row]

def __iter__(self):
    return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    # Redirect output to a queue
    self.queue = cStringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f
    self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
    self.writer.writerow([s.encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = self.encoder.encode(data)
    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

you should use the CSV module for sure , but the chances are , you need to write unicode . For those Who need to write unicode , this is the class from example page , that you can use as a util module:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

def __iter__(self):
    return self

def next(self):
    return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    f = UTF8Recoder(f, encoding)
    self.reader = csv.reader(f, dialect=dialect, **kwds)

def next(self):
    row = self.reader.next()
    return [unicode(s, "utf-8") for s in row]

def __iter__(self):
    return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    # Redirect output to a queue
    self.queue = cStringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f
    self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
    self.writer.writerow([s.encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = self.encoder.encode(data)
    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

回答 10

这是不需要csv模块的另一种解决方案。

print ', '.join(['"'+i+'"' for i in myList])

范例:

>>> myList = [u'value 1', u'value 2', u'value 3']
>>> print ', '.join(['"'+i+'"' for i in myList])
"value 1", "value 2", "value 3"

但是,如果初始列表包含一些“”,则不会对其进行转义。如果需要,可以调用一个函数来对其进行转义,如下所示:

print ', '.join(['"'+myFunction(i)+'"' for i in myList])

Here is another solution that does not require the csv module.

print ', '.join(['"'+i+'"' for i in myList])

Example :

>>> myList = [u'value 1', u'value 2', u'value 3']
>>> print ', '.join(['"'+i+'"' for i in myList])
"value 1", "value 2", "value 3"

However, if the initial list contains some “, they will not be escaped. If it is required, it is possible to call a function to escape it like that :

print ', '.join(['"'+myFunction(i)+'"' for i in myList])

_csv。错误:字段大于字段限制(131072)

问题:_csv。错误:字段大于字段限制(131072)

我在具有很大字段的csv文件中读取了一个脚本:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

但是,这会在某些csv文件上引发以下错误:

_csv.Error: field larger than field limit (131072)

如何分析具有巨大字段的csv文件?跳过具有巨大字段的行不是一种选择,因为需要在后续步骤中分析数据。

I have a script reading in a csv file with very huge fields:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

However, this throws the following error on some csv files:

_csv.Error: field larger than field limit (131072)

How can I analyze csv files with huge fields? Skipping the lines with huge fields is not an option as the data needs to be analyzed in subsequent steps.


回答 0

csv文件可能包含非常大的字段,因此请增加field_size_limit

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize适用于Python 2.x和3.x。sys.maxint仅适用于Python 2.x(因此:what-is-sys-maxint-in-python-3

更新资料

正如Geoff指出的那样,以上代码可能会导致以下错误:OverflowError: Python int too large to convert to C long。为了避免这种情况,您可以使用以下快速而又肮脏的代码(该代码应该在使用Python 2和Python 3的每个系统上都可以使用):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

The csv file might contain very huge fields, therefore increase the field_size_limit:

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize works for Python 2.x and 3.x. sys.maxint would only work with Python 2.x (SO: what-is-sys-maxint-in-python-3)

Update

As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long. To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

回答 1

这可能是因为您的CSV文件中嵌入了单引号或双引号。如果您的CSV文件以制表符分隔,请尝试按以下方式打开它:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

This could be because your CSV file has embedded single or double quotes. If your CSV file is tab-delimited try opening it as:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

回答 2

下面是检查当前限制

csv.field_size_limit()

出[20]:131072

以下是增加限制。将其添加到代码中

csv.field_size_limit(100000000)

再次尝试检查限制

csv.field_size_limit()

出[22]:100000000

现在您将不会收到错误“ _csv.Error:字段大于字段限制(131072)”

Below is to check the current limit

csv.field_size_limit()

Out[20]: 131072

Below is to increase the limit. Add it to the code

csv.field_size_limit(100000000)

Try checking the limit again

csv.field_size_limit()

Out[22]: 100000000

Now you won’t get the error “_csv.Error: field larger than field limit (131072)”


回答 3

csv字段大小是通过[Python 3.Docs]:csv控制的。field_size_limit[new_limit]

返回解析器允许的当前最大字段大小。如果指定了new_limit,它将成为新的限制。

默认情况下将其设置为128k0x20000131072),对于任何合适的.csv来说,这应该足够了:

>>> import csv
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'

但是,当处理.csv文件(使用正确的引号定界符)时,(至少)一个字段的长度大于此大小,则会弹出错误。
为了消除错误,应该增加大小限制(为避免任何麻烦,请尝试最大可能的值)。

在后台(请查看[GitHub]:python / cpython-(主)cpython / Modules / _csv.c了解实现细节),保存此值的变量为C long[Wikipedia]:C数据类型),其大小根据CPU体系结构和OSI L P)的不同而不同。经典的区别:对于64位 操作系统Python构建),字体大小(以位单位)为:

  • 尼克斯64
  • 32

尝试设置它时,新值被检查为在边界内,这就是为什么在某些情况下会弹出另一个异常的原因(这种情况在Win上很常见):

>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

为避免遇到此问题,请使用技巧((由于[Python 3.Docs]:ctypes-Python的外部函数库))设置(最大可能)限制(LONG_MAX)。它应该可以在任何CPU / OS上的Python 3Python 2运行

>>> import ctypes as ct
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'

Nix之类的OS上的64位Python

>>> import sys, csv, ctypes as ct
>>>
>>> sys.platform, sys.maxsize
('linux', 9223372036854775807)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'

对于32位 Python,事情是统一的:这是Win上遇到的行为。

请查看以下资源,以获取有关以下内容的更多详细信息:

csv field sizes are controlled via [Python 3.Docs]: csv.field_size_limit([new_limit]):

Returns the current maximum field size allowed by the parser. If new_limit is given, this becomes the new limit.

It is set by default to 128k or 0x20000 (131072), which should be enough for any decent .csv:

>>> import csv
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'

However, when dealing with a .csv file (with the correct quoting and delimiter) having (at least) one field longer than this size, the error pops up.
To get rid of the error, the size limit should be increased (to avoid any worries, the maximum possible value is attempted).

Behind the scenes (check [GitHub]: python/cpython – (master) cpython/Modules/_csv.c for implementation details), the variable that holds this value is a C long ([Wikipedia]: C data types), whose size varies depending on CPU architecture and OS (ILP). The classical difference: for a 64bit OS (Python build), the long type size (in bits) is:

  • Nix: 64
  • Win: 32

When attempting to set it, the new value is checked to be in the long boundaries, that’s why in some cases another exception pops up (this case is common on Win):

>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

To avoid running into this problem, set the (maximum possible) limit (LONG_MAX) using an artifice (thanks to [Python 3.Docs]: ctypes – A foreign function library for Python). It should work on Python 3 and Python 2, on any CPU / OS.

>>> import ctypes as ct
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'

64bit Python on a Nix like OS:

>>> import sys, csv, ctypes as ct
>>>
>>> sys.platform, sys.maxsize
('linux', 9223372036854775807)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'

For 32bit Python, things are uniform: it’s the behavior encountered on Win.

Check the following resources for more details on:


回答 4

我只是在“纯” CSV文件中遇到了这种情况。有些人可能将其称为无效的格式化文件。没有转义字符,没有双引号和定界符是分号。

该文件中的示例行如下所示:

第一个单元格;第二个“带有双引号和前导空格的单元格;“部分引用”单元格;最后一个单元格

第二个单元格中的单引号将使解析器脱离其轨道。起作用的是:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)

I just had this happen to me on a ‘plain’ CSV file. Some people might call it an invalid formatted file. No escape characters, no double quotes and delimiter was a semicolon.

A sample line from this file would look like this:

First cell; Second ” Cell with one double quote and leading space;’Partially quoted’ cell;Last cell

the single quote in the second cell would throw the parser off its rails. What worked was:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)

回答 5

有时,一行包含双引号列。当csv阅读器尝试读取此行时,不理解该列的末尾并触发此引发。解决方案如下:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)

Sometimes, a row contain double quote column. When csv reader try read this row, not understood end of column and fire this raise. Solution is below:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)

回答 6

您可以使用read_csvfrom pandas跳过这些行。

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

You can use read_csv from pandas to skip these lines.

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

回答 7

找到通常放在.cassandra目录中的cqlshrc文件。

在该文件中,

[csv]
field_size_limit = 1000000000

Find the cqlshrc file usually placed in .cassandra directory.

In that file append,

[csv]
field_size_limit = 1000000000

从字符串创建Pandas DataFrame

问题:从字符串创建Pandas DataFrame

为了测试某些功能,我想DataFrame从字符串创建一个。假设我的测试数据如下:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

将数据读入熊猫的最简单方法是什么DataFrame

In order to test some functionality I would like to create a DataFrame from a string. Let’s say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?


回答 0

一种简单的方法是使用StringIO.StringIO(python2)io.StringIO(python3)并将其传递给pandas.read_csv函数。例如:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

回答 1

分割法

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

回答 2

交互式工作的快速简便解决方案是通过从剪贴板加载数据来复制和粘贴文本。

用鼠标选择字符串的内容:

在Python Shell中使用 read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

使用适当的分隔符:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

回答 3

传统的可变宽度CSV无法将数据存储为字符串变量。尤其是在.py文件内部使用时,请考虑使用定宽管道分隔数据。各种IDE和编辑器可能都有一个插件,用于将管道分隔的文本格式化为整齐的表。

使用 read_csv

将以下内容存储在实用程序模块中,例如util/pandas.py。函数的文档字符串中包含一个示例。

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

非工作选择

以下代码无法正常运行,因为它在左侧和右侧都添加了一个空列。

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

至于read_fwf,它实际上并没有使用太多read_csv接受和使用的可选kwarg 。因此,它根本不应该用于管道分隔的数据。

This answer applies when a string is manually entered, not when it’s read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using read_csv

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function’s docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn’t work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn’t actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn’t be used at all for pipe-separated data.


回答 4

最简单的方法是将其保存到临时文件,然后读取它:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

创建临时文件的正确方法:如何在Python中创建tmp文件?

Simplest way is to save it to temp file and then read it:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

Right way of creating temp file: How can I create a tmp file in Python?