I am using below referred code to edit a csv using Python. Functions called in the code form upper part of the code.
Problem: I want the below referred code to start editing the csv from 2nd row, I want it to exclude 1st row which contains headers. Right now it is applying the functions on 1st row only and my header row is getting changed.
with open("tmob_notcleaned.csv","rb")as infile, open("tmob_cleaned.csv","wb")as outfile:
reader = csv.reader(infile)
next(reader,None)# skip the headers
writer = csv.writer(outfile)for row in reader:# process each row
writer.writerow(row)# no need to close, the files are closed automatically when you get to this point.
Your reader variable is an iterable, by looping over it you retrieve the rows.
To make it skip one item before your loop, simply call next(reader, None) and ignore the return value.
You can also simplify your code a little; use the opened files as context managers to have them closed automatically:
with open("tmob_notcleaned.csv", "rb") as infile, open("tmob_cleaned.csv", "wb") as outfile:
reader = csv.reader(infile)
next(reader, None) # skip the headers
writer = csv.writer(outfile)
for row in reader:
# process each row
writer.writerow(row)
# no need to close, the files are closed automatically when you get to this point.
If you wanted to write the header to the output file unprocessed, that’s easy too, pass the output of next() to writer.writerow():
headers = next(reader, None) # returns the headers or `None` if the input is empty
if headers:
writer.writerow(headers)
回答 1
解决此问题的另一种方法是使用DictReader类,该类“跳过”标题行并将其用于允许命名索引。
给定“ foo.csv”,如下所示:
FirstColumn,SecondColumn
asdf,1234
qwer,5678
像这样使用DictReader:
import csv
with open('foo.csv')as f:
reader = csv.DictReader(f, delimiter=',')for row in reader:print(row['FirstColumn'])# Access by column header instead of column numberprint(row['SecondColumn'])
Another way of solving this is to use the DictReader class, which “skips” the header row and uses it to allowed named indexing.
Given “foo.csv” as follows:
FirstColumn,SecondColumn
asdf,1234
qwer,5678
Use DictReader like this:
import csv
with open('foo.csv') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print(row['FirstColumn']) # Access by column header instead of column number
print(row['SecondColumn'])
open() returns a file object, and is most commonly used with two arguments: open(filename, mode).
>>> f = open('workfile', 'w')
>>> print f <open file 'workfile', mode 'w' at 80a0960>
The first argument is a string containing the filename. The second argument is
another string containing a few characters describing the way in which
the file will be used. mode can be ‘r’ when the file will only be
read, ‘w’ for only writing (an existing file with the same name will
be erased), and ‘a’ opens the file for appending; any data written to
the file is automatically added to the end. ‘r+’ opens the file for
both reading and writing. The mode argument is optional; ‘r’ will be
assumed if it’s omitted.
On Windows, ‘b’ appended to the mode opens the file in binary mode, so
there are also modes like ‘rb’, ‘wb’, and ‘r+b’. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a ‘b’ to
the mode, so you can use it platform-independently for all binary
files.
If the file exists and contains data, then it is possible to generate the fieldname parameter for csv.DictWriter automatically:
# read header automatically
with open(myFile, "r") as f:
reader = csv.reader(f)
for header in reader:
break
# add row to CSV file
with open(myFile, "a", newline='') as f:
writer = csv.DictWriter(f, fieldnames=header)
writer.writerow(myDict)
回答 5
# I like using the codecs opening in a with
field_names =['latitude','longitude','date','user','text']with codecs.open(filename,"ab", encoding='utf-8')as logfile:
logger = csv.DictWriter(logfile, fieldnames=field_names)
logger.writeheader()# some more code stuff for video in aList:
video_result ={}
video_result['date']= video['snippet']['publishedAt']
video_result['user']= video['id']
video_result['text']= video['snippet']['description'].encode('utf8')
logger.writerow(video_result)
# I like using the codecs opening in a with
field_names = ['latitude', 'longitude', 'date', 'user', 'text']
with codecs.open(filename,"ab", encoding='utf-8') as logfile:
logger = csv.DictWriter(logfile, fieldnames=field_names)
logger.writeheader()
# some more code stuff
for video in aList:
video_result = {}
video_result['date'] = video['snippet']['publishedAt']
video_result['user'] = video['id']
video_result['text'] = video['snippet']['description'].encode('utf8')
logger.writerow(video_result)
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
While I upvoted EdChum’s answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized – that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not “keep only date part”, since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
… and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP’s would have preferred is probably “recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations”.
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
Just giving a more up to date answer in case someone sees this old post.
Adding “utc=False” when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error “ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel.”
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
You can convert a string to a file object using io.StringIO and then pass that to the csv module:
from io import StringIO
import csv
scsv = """text,with,Polish,non-Latin,letters
1,2,3,4,5,6
a,b,c,d,e,f
gęś,zółty,wąż,idzie,wąską,dróżką,
"""
f = StringIO(scsv)
reader = csv.reader(f, delimiter=',')
for row in reader:
print('\t'.join(row))
simpler version with split() on newlines:
reader = csv.reader(scsv.split('\n'), delimiter=',')
for row in reader:
print('\t'.join(row))
Or you can simply split() this string into lines using \n as separator, and then split() each line into values, but this way you must be aware of quoting, so using csv module is preferred.
On Python 2 you have to import StringIO as
from StringIO import StringIO
instead.
回答 1
简单-csv模块也可以使用列表:
>>> a=["1,2,3","4,5,6"] # or a = "1,2,3\n4,5,6".split('\n')>>> import csv
>>> x = csv.reader(a)
>>> list(x)
[['1', '2', '3'], ['4', '5', '6']]
import csv
text = """1,2,3
a,b,c
d,e,f"""
lines = text.splitlines()
reader = csv.reader(lines, delimiter=',')
for row in reader:
print('\t'.join(row))
回答 3
>>> a = "1,2">>> a
'1,2'>>> b = a.split(",")
>>> b
['1', '2']
解析CSV文件:
f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if neededfor line in lines:
if line != "": # add other needed checks to skip titles
cols = line.split(",")
print cols
>>> a = "1,2"
>>> a
'1,2'
>>> b = a.split(",")
>>> b
['1', '2']
To parse a CSV file:
f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if needed
for line in lines:
if line != "": # add other needed checks to skip titles
cols = line.split(",")
print cols
As others have already pointed out, Python includes a module to read and write CSV files. It works pretty well as long as the input characters stay within ASCII limits. In case you want to process other encodings, more work is needed.
The Python documentation for the csv module implements an extension of csv.reader, which uses the same interface but can handle other encodings and returns unicode strings. Just copy and paste the code from the documentation. After that, you can process a CSV file like this:
with open("some.csv", "rb") as csvFile:
for row in UnicodeReader(csvFile, encoding="iso-8859-15"):
print row
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
The above answer is already satisfying the topic. Anyway, if you need all the data in memory – have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
回答 5
您可以将数据读取为大块,并将每个大块另存为泡菜。
import pandas as pd import pickle
in_path =""#Path where the large file is
out_path =""#Path to save the pickle files to
chunk_size =400000#size of chunks relies on your available memory
separator ="~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)for i, chunk in enumerate(reader):
out_file = out_path +"/data_{}.pkl".format(i+1)with open(out_file,"wb")as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
在下一步中,您将读取泡菜并将每个泡菜附加到所需的数据框中。
import glob
pickle_path =""#Same Path as out_path i.e. where the pickle files are
data_p_files=[]for name in glob.glob(pickle_path +"/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
TextFileReader= pd.read_csv(path, chunksize=1000)# the number of rows per chunk
dfList =[]for df inTextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
回答 8
下面是一个示例:
chunkTemp =[]
queryTemp =[]
query = pd.DataFrame()for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns ={c: c.replace(' ','')for c in chunk.columns})#YOU CAN EITHER: #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]#BUFFERING PROCESSED DATA
queryTemp.append(query)#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOPprint("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)print("Database: LOADED")#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)print(query)
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
The maximum value of UNSIGNED CHAR =255The minimum value of SHORT INT =-32768The maximum value of SHORT INT =32767The minimum value of INT =-2147483648The maximum value of INT =2147483647The minimum value of CHAR =-128The maximum value of CHAR =127The minimum value of LONG =-9223372036854775808The maximum value of LONG =9223372036854775807
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
“dtypes” is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let’s say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It’d be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Dask is a framework that is defined in Dask’s website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
def apply(dfg):# do stuffreturn dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db','tablename')# fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# slow but flexible
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here’s a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.
import csv
with open('file.csv','rb')as f:
reader = csv.reader(f)
your_list = list(reader)print your_list
# [['This is the first line', 'Line1'],# ['This is the second line', 'Line2'],# ['This is the third line', 'Line3']]
import csv
with open('file.csv', 'rb') as f:
reader = csv.reader(f)
your_list = list(reader)
print your_list
# [['This is the first line', 'Line1'],
# ['This is the second line', 'Line2'],
# ['This is the third line', 'Line3']]
import pandas as pd
# Read the CSV into a pandas data frame (df)# With a df you can do many things# most important: visualize data with Seaborn
df = pd.read_csv('filename.csv', delimiter=',')# Or export it in many ways, e.g. a list of tuples
tuples =[tuple(x)for x in df.values]# or export it as a list of dicts
dicts = df.to_dict().values()
import pandas as pd
# Get data - reading the CSV fileimport mpu.pd
df = mpu.pd.example_df()# Convert
dicts = df.to_dict('records')
df的内容是:
country population population_time EUR
0Germany82521653.02016-12-01True1France66991000.02017-01-01True2Indonesia255461700.02017-01-01False3Ireland4761865.0NaTTrue4Spain46549045.02017-06-01True5VaticanNaNNaTTrue
import pandas as pd
# Get data - reading the CSV fileimport mpu.pd
df = mpu.pd.example_df()# Convert
lists =[[row[col]for col in df.columns]for row in df.to_dict('records')]
Pandas is pretty good at dealing with data. Here is one example how to use it:
import pandas as pd
# Read the CSV into a pandas data frame (df)
# With a df you can do many things
# most important: visualize data with Seaborn
df = pd.read_csv('filename.csv', delimiter=',')
# Or export it in many ways, e.g. a list of tuples
tuples = [tuple(x) for x in df.values]
# or export it as a list of dicts
dicts = df.to_dict().values()
One big advantage is that pandas deals automatically with header rows.
If you haven’t heard of Seaborn, I recommend having a look at it.
import pandas as pd
# Get data - reading the CSV file
import mpu.pd
df = mpu.pd.example_df()
# Convert
lists = [[row[col] for col in df.columns] for row in df.to_dict('records')]
If you are sure there are no commas in your input, other than to separate the category, you can read the file line by line and split on ,, then push the result to List
That said, it looks like you are looking at a CSV file, so you might consider using the modules for it
回答 5
result =[]for line in text.splitlines():
result.append(tuple(line.split(",")))
As said already in the comments you can use the csv library in python. csv means comma separated values which seems exactly your case: a label and a value separated by a comma.
Being a category and value type I would rather use a dictionary type instead of a list of tuples.
Anyway in the code below I show both ways: d is the dictionary and l is the list of tuples.
import csv
file_name = "test.txt"
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
d = dict()
l = list()
for row in csvReader:
d[row[1]] = row[0]
l.append((row[0], row[1]))
print(d)
print(l)
回答 7
一个简单的循环就足够了:
lines =[]with open('test.txt','r')as f:for line in f.readlines():
l,name = line.strip().split(',')
lines.append((l,name))print lines
Unfortunately I find none of the existing answers particularly satisfying.
Here is a straightforward and complete Python 3 solution, using the csv module.
import csv
with open('../resources/temp_in.csv', newline='') as f:
reader = csv.reader(f, skipinitialspace=True)
rows = list(reader)
print(rows)
Notice the skipinitialspace=True argument. This is necessary since, unfortunately, OP’s CSV contains whitespace after each comma.
Output:
[['This is the first line', 'Line1'], ['This is the second line', 'Line2'], ['This is the third line', 'Line3']]
回答 9
稍微扩展您的需求,并假设您不关心行的顺序,并希望将它们分组在类别下,则以下解决方案可能适用于您:
>>> fname ="lines.txt">>>from collections import defaultdict
>>> dct = defaultdict(list)>>>with open(fname)as f:...for line in f:... text, cat = line.rstrip("\n").split(",",1)... dct[cat].append(text)...>>> dct
defaultdict(<type 'list'>,{' CatA':['This is the first line','This is the another line'],' CatC':['This is the third line'],' CatB':['This is the second line','This is the last line']})
Extending your requirements a bit and assuming you do not care about the order of lines and want to get them grouped under categories, the following solution may work for you:
>>> fname = "lines.txt"
>>> from collections import defaultdict
>>> dct = defaultdict(list)
>>> with open(fname) as f:
... for line in f:
... text, cat = line.rstrip("\n").split(",", 1)
... dct[cat].append(text)
...
>>> dct
defaultdict(<type 'list'>, {' CatA': ['This is the first line', 'This is the another line'], ' CatC': ['This is the third line'], ' CatB': ['This is the second line', 'This is the last line']})
This way you get all relevant lines available in the dictionary under key being the category.
回答 10
这是Python 3.x中最简单的将CSV导入多维数组的方法,它仅4行代码而无需导入任何内容!
#pull a CSV into a multidimensional array in 4 lines!
L=[]#Create an empty list for the main arrayfor line in open('log.txt'):#Open the file and read all the lines
x=line.rstrip()#Strip the \n from each line
L.append(x.split(','))#Split each line into a list and add it to the#Multidimensional arrayprint(L)
Here is the easiest way in Python 3.x to import a CSV to a multidimensional array, and its only 4 lines of code without importing anything!
#pull a CSV into a multidimensional array in 4 lines!
L=[] #Create an empty list for the main array
for line in open('log.txt'): #Open the file and read all the lines
x=line.rstrip() #Strip the \n from each line
L.append(x.split(',')) #Split each line into a list and add it to the
#Multidimensional array
print(L)
I have a JSON file I want to convert to a CSV file. How can I do this with Python?
I tried:
import json
import csv
f = open('data.json')
data = json.load(f)
f.close()
f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
csv_file.writerow(item)
f.close()
However, it did not work. I am using Django and the error I received is:
`file' object has no attribute 'writerow'`
I then tried the following:
import json
import csv
f = open('data.json')
data = json.load(f)
f.close()
f = open('data.csv')
csv_file = csv.writer(f)
for item in data:
f.writerow(item) # ← changed
f.close()
With the pandaslibrary, this is as easy as using two commands!
pandas.read_json()
To convert a JSON string to a pandas object (either a series or dataframe). Then, assuming the results were stored as df:
df.to_csv()
Which can either return a string or write directly to a csv-file.
Based on the verbosity of previous answers, we should all thank pandas for the shortcut.
回答 2
我假设您的JSON文件将解码为词典列表。首先,我们需要一个将JSON对象展平的函数:
def flattenjson( b, delim ):
val ={}for i in b.keys():if isinstance( b[i], dict ):
get = flattenjson( b[i], delim )for j in get.keys():
val[ i + delim + j ]= get[j]else:
val[i]= b[i]return val
I am assuming that your JSON file will decode into a list of dictionaries. First we need a function which will flatten the JSON objects:
def flattenjson( b, delim ):
val = {}
for i in b.keys():
if isinstance( b[i], dict ):
get = flattenjson( b[i], delim )
for j in get.keys():
val[ i + delim + j ] = get[j]
else:
val[i] = b[i]
return val
The result of running this snippet on your JSON object:
JSON can represent a wide variety of data structures — a JS “object” is roughly like a Python dict (with string keys), a JS “array” roughly like a Python list, and you can nest them as long as the final “leaf” elements are numbers or strings.
CSV can essentially represent only a 2-D table — optionally with a first row of “headers”, i.e., “column names”, which can make the table interpretable as a list of dicts, instead of the normal interpretation, a list of lists (again, “leaf” elements can be numbers or strings).
So, in the general case, you can’t translate an arbitrary JSON structure to a CSV. In a few special cases you can (array of arrays with no further nesting; arrays of objects which all have exactly the same keys). Which special case, if any, applies to your problem? The details of the solution depend on which special case you do have. Given the astonishing fact that you don’t even mention which one applies, I suspect you may not have considered the constraint, neither usable case in fact applies, and your problem is impossible to solve. But please do clarify!
This code should work for you, assuming that your JSON data is in a file called data.json.
import json
import csv
with open("data.json") as file:
data = json.load(file)
with open("data.csv", "w") as file:
csv_file = csv.writer(file)
for item in data:
fields = list(item['fields'].values())
csv_file.writerow([item['pk'], item['model']] + fields)
import json
import csv
f = open('test.json')
data = json.load(f)
f.close()
f=csv.writer(open('test.csv','wb+'))
for item in data:
f.writerow([item['pk'], item['model']] + item['fields'].values())
import pandas as pd
# As of Pandas 1.01, json_normalize as pandas.io.json.json_normalize is deprecated and is now exposed in the top-level namespace.# from pandas.io.json import json_normalizefrom pathlib importPathimport json
# set path to file
p =Path(r'c:\some_path_to_file\test.json')# read jsonwith p.open('r', encoding='utf-8')as f:
data = json.loads(f.read())# create dataframe
df = pd.json_normalize(data)# dataframe view
pk model fields.codename fields.name fields.content_type
22 auth.permission add_logentry Can add log entry 823 auth.permission change_logentry Can change log entry 824 auth.permission delete_logentry Can delete log entry 84 auth.permission add_group Can add group 210 auth.permission add_message Can add message 4# save to csv
df.to_csv('test.csv', index=False, encoding='utf-8')
Given the data provided, in a file named test.json.
encoding='utf-8' may not be necessary.
The following code takes advantage of the pathlib library.
.open is a method of pathlib.
Works with non-Windows paths too.
import pandas as pd
# As of Pandas 1.01, json_normalize as pandas.io.json.json_normalize is deprecated and is now exposed in the top-level namespace.
# from pandas.io.json import json_normalize
from pathlib import Path
import json
# set path to file
p = Path(r'c:\some_path_to_file\test.json')
# read json
with p.open('r', encoding='utf-8') as f:
data = json.loads(f.read())
# create dataframe
df = pd.json_normalize(data)
# dataframe view
pk model fields.codename fields.name fields.content_type
22 auth.permission add_logentry Can add log entry 8
23 auth.permission change_logentry Can change log entry 8
24 auth.permission delete_logentry Can delete log entry 8
4 auth.permission add_group Can add group 2
10 auth.permission add_message Can add message 4
# save to csv
df.to_csv('test.csv', index=False, encoding='utf-8')
As mentioned in the previous answers the difficulty in converting json to csv is because a json file can contain nested dictionaries and therefore be a multidimensional data structure verses a csv which is a 2D data structure. However, a good way to turn a multidimensional structure to a csv is to have multiple csvs that tie together with primary keys.
In your example, the first csv output has the columns “pk”,”model”,”fields” as your columns. Values for “pk”, and “model” are easy to get but because the “fields” column contains a dictionary, it should be its own csv and because “codename” appears to the be the primary key, you can use as the input for “fields” to complete the first csv. The second csv contains the dictionary from the “fields” column with codename as the the primary key that can be used to tie the 2 csvs together.
Here is a solution for your json file which converts a nested dictionaries to 2 csvs.
import csv
import json
def readAndWrite(inputFileName, primaryKey=""):
input = open(inputFileName+".json")
data = json.load(input)
input.close()
header = set()
if primaryKey != "":
outputFileName = inputFileName+"-"+primaryKey
if inputFileName == "data":
for i in data:
for j in i["fields"].keys():
if j not in header:
header.add(j)
else:
outputFileName = inputFileName
for i in data:
for j in i.keys():
if j not in header:
header.add(j)
with open(outputFileName+".csv", 'wb') as output_file:
fieldnames = list(header)
writer = csv.DictWriter(output_file, fieldnames, delimiter=',', quotechar='"')
writer.writeheader()
for x in data:
row_value = {}
if primaryKey == "":
for y in x.keys():
yValue = x.get(y)
if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
row_value[y] = str(yValue).encode('utf8')
elif type(yValue) != dict:
row_value[y] = yValue.encode('utf8')
else:
if inputFileName == "data":
row_value[y] = yValue["codename"].encode('utf8')
readAndWrite(inputFileName, primaryKey="codename")
writer.writerow(row_value)
elif primaryKey == "codename":
for y in x["fields"].keys():
yValue = x["fields"].get(y)
if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
row_value[y] = str(yValue).encode('utf8')
elif type(yValue) != dict:
row_value[y] = yValue.encode('utf8')
writer.writerow(row_value)
readAndWrite("data")
I know it has been a long time since this question has been asked but I thought I might add to everyone else’s answer and share a blog post that I think explain the solution in a very concise way.
csvwriter = csv.writer(employ_data)
count = 0
for emp in emp_data:
if count == 0:
header = emp.keys()
csvwriter.writerow(header)
count += 1
csvwriter.writerow(emp.values())
Make sure to close the file in order to save the contents
employ_data.close()
回答 11
这不是一个很聪明的方法,但是我遇到了同样的问题,这对我有用:
import csv
f = open('data.json')
data = json.load(f)
f.close()
new_data =[]for i in data:
flat ={}
names = i.keys()for n in names:try:if len(i[n].keys())>0:for ii in i[n].keys():
flat[n+"_"+ii]= i[n][ii]except:
flat[n]= i[n]
new_data.append(flat)
f = open(filename,"r")
writer = csv.DictWriter(f, new_data[0].keys())
writer.writeheader()for row in new_data:
writer.writerow(row)
f.close()
It is not a very smart way to do it, but I have had the same problem and this worked for me:
import csv
f = open('data.json')
data = json.load(f)
f.close()
new_data = []
for i in data:
flat = {}
names = i.keys()
for n in names:
try:
if len(i[n].keys()) > 0:
for ii in i[n].keys():
flat[n+"_"+ii] = i[n][ii]
except:
flat[n] = i[n]
new_data.append(flat)
f = open(filename, "r")
writer = csv.DictWriter(f, new_data[0].keys())
writer.writeheader()
for row in new_data:
writer.writerow(row)
f.close()
Alec’s answer is great, but it doesn’t work in the case where there are multiple levels of nesting. Here’s a modified version that supports multiple levels of nesting. It also makes the header names a bit nicer if the nested object already specifies its own key (e.g. Firebase Analytics / BigTable / BigQuery data):
"""Converts JSON with nested fields into a flattened CSV file.
"""
import sys
import json
import csv
import os
import jsonlines
from orderedset import OrderedSet
# from https://stackoverflow.com/a/28246154/473201
def flattenjson( b, prefix='', delim='/', val=None ):
if val is None:
val = {}
if isinstance( b, dict ):
for j in b.keys():
flattenjson(b[j], prefix + delim + j, delim, val)
elif isinstance( b, list ):
get = b
for j in range(len(get)):
key = str(j)
# If the nested data contains its own key, use that as the header instead.
if isinstance( get[j], dict ):
if 'key' in get[j]:
key = get[j]['key']
flattenjson(get[j], prefix + delim + key, delim, val)
else:
val[prefix] = b
return val
def main(argv):
if len(argv) < 2:
raise Error('Please specify a JSON file to parse')
print "Loading and Flattening..."
filename = argv[1]
allRows = []
fieldnames = OrderedSet()
with jsonlines.open(filename) as reader:
for obj in reader:
# print 'orig:\n'
# print obj
flattened = flattenjson(obj)
#print 'keys: %s' % flattened.keys()
# print 'flattened:\n'
# print flattened
fieldnames.update(flattened.keys())
allRows.append(flattened)
print "Exporting to CSV..."
outfilename = filename + '.csv'
count = 0
with open(outfilename, 'w') as file:
csvwriter = csv.DictWriter(file, fieldnames=fieldnames)
csvwriter.writeheader()
for obj in allRows:
# print 'allRows:\n'
# print obj
csvwriter.writerow(obj)
count += 1
print "Wrote %d rows" % count
if __name__ == '__main__':
main(sys.argv)
回答 13
这相对较好。它将json展平以将其写入csv文件。嵌套元素被管理:)
那是为了python 3
import json
o = json.loads('your json string')# Be careful, o must be a list, each of its objects will make a line of the csv.def flatten(o, k='/'):global l, c_line
if isinstance(o, dict):for key, value in o.items():
flatten(value, k +'/'+ key)elif isinstance(o, list):for ov in o:
flatten(ov,'')elif isinstance(o, str):
o = o.replace('\r',' ').replace('\n',' ').replace(';',',')ifnot k in l:
l[k]={}
l[k][c_line]=o
def render_csv(l):
ftime =Truefor i in range(100):#len(l[list(l.keys())[0]])for k in l:if ftime :print('%s;'% k, end='')continue
v = l[k]try:print('%s;'% v[i], end='')except:print(';', end='')print()
ftime =False
i =0def json_to_csv(object_list):global l, c_line
l ={}
c_line =0for ov in object_list :# Assumes json is a list of objects
flatten(ov)
c_line +=1
render_csv(l)
json_to_csv(o)
This works relatively well.
It flattens the json to write it to a csv file.
Nested elements are managed :)
That’s for python 3
import json
o = json.loads('your json string') # Be careful, o must be a list, each of its objects will make a line of the csv.
def flatten(o, k='/'):
global l, c_line
if isinstance(o, dict):
for key, value in o.items():
flatten(value, k + '/' + key)
elif isinstance(o, list):
for ov in o:
flatten(ov, '')
elif isinstance(o, str):
o = o.replace('\r',' ').replace('\n',' ').replace(';', ',')
if not k in l:
l[k]={}
l[k][c_line]=o
def render_csv(l):
ftime = True
for i in range(100): #len(l[list(l.keys())[0]])
for k in l:
if ftime :
print('%s;' % k, end='')
continue
v = l[k]
try:
print('%s;' % v[i], end='')
except:
print(';', end='')
print()
ftime = False
i = 0
def json_to_csv(object_list):
global l, c_line
l = {}
c_line = 0
for ov in object_list : # Assumes json is a list of objects
flatten(ov)
c_line += 1
render_csv(l)
json_to_csv(o)
enjoy.
回答 14
我解决这个问题的简单方法:
创建一个新的Python文件,例如:json_to_csv.py
添加此代码:
import csv, json, sys
#if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8")#check if you pass the input file and output fileif sys.argv[1]isnotNoneand sys.argv[2]isnotNone:
fileInput = sys.argv[1]
fileOutput = sys.argv[2]
inputFile = open(fileInput)
outputFile = open(fileOutput,'w')
data = json.load(inputFile)
inputFile.close()
output = csv.writer(outputFile)
output.writerow(data[0].keys())# header rowfor row in data:
output.writerow(row.values())
import csv, json, sys
#if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8")
#check if you pass the input file and output file
if sys.argv[1] is not None and sys.argv[2] is not None:
fileInput = sys.argv[1]
fileOutput = sys.argv[2]
inputFile = open(fileInput)
outputFile = open(fileOutput, 'w')
data = json.load(inputFile)
inputFile.close()
output = csv.writer(outputFile)
output.writerow(data[0].keys()) # header row
for row in data:
output.writerow(row.values())
After add this code, save the file and run at the terminal:
Surprisingly, I found that none of the answers posted here so far correctly deal with all possible scenarios (e.g., nested dicts, nested lists, None values, etc).
This solution should work across all scenarios:
def flatten_json(json):
def process_value(keys, value, flattened):
if isinstance(value, dict):
for key in value.keys():
process_value(keys + [key], value[key], flattened)
elif isinstance(value, list):
for idx, v in enumerate(value):
process_value(keys + [str(idx)], v, flattened)
else:
flattened['__'.join(keys)] = value
flattened = {}
for key in json.keys():
process_value([key], json[key], flattened)
return flattened
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 17 20:35:35 2019
author: Ram
"""
import json
import csv
with open("file1.json") as file:
data = json.load(file)
# create the csv writer object
pt_data1 = open('pt_data1.csv', 'w')
csvwriter = csv.writer(pt_data1)
count = 0
for pt in data:
if count == 0:
header = pt.keys()
csvwriter.writerow(header)
count += 1
csvwriter.writerow(pt.values())
pt_data1.close()
回答 18
修改了Alec McGail的答案以支持内部带有列表的JSON
def flattenjson(self, mp, delim="|"):
ret =[]if isinstance(mp, dict):for k in mp.keys():
csvs = self.flattenjson(mp[k], delim)for csv in csvs:
ret.append(k + delim + csv)elif isinstance(mp, list):for k in mp:
csvs = self.flattenjson(k, delim)for csv in csvs:
ret.append(csv)else:
ret.append(mp)return ret
Modified Alec McGail’s answer to support JSON with lists inside
def flattenjson(self, mp, delim="|"):
ret = []
if isinstance(mp, dict):
for k in mp.keys():
csvs = self.flattenjson(mp[k], delim)
for csv in csvs:
ret.append(k + delim + csv)
elif isinstance(mp, list):
for k in mp:
csvs = self.flattenjson(k, delim)
for csv in csvs:
ret.append(csv)
else:
ret.append(mp)
return ret
Thanks!
回答 19
import json,csv
t=''
t=(type('a'))
json_data =[]
data =None
write_header =True
item_keys =[]try:with open('kk.json')as json_file:
json_data = json_file.read()
data = json.loads(json_data)exceptExceptionas e:print( e)with open('bar.csv','at')as csv_file:
writer = csv.writer(csv_file)#, quoting=csv.QUOTE_MINIMAL)for item in data:
item_values =[]for key in item:if write_header:
item_keys.append(key)
value = item.get(key,'')if(type(value)==t):
item_values.append(value.encode('utf-8'))else:
item_values.append(value)if write_header:
writer.writerow(item_keys)
write_header =False
writer.writerow(item_values)
import json,csv
t=''
t=(type('a'))
json_data = []
data = None
write_header = True
item_keys = []
try:
with open('kk.json') as json_file:
json_data = json_file.read()
data = json.loads(json_data)
except Exception as e:
print( e)
with open('bar.csv', 'at') as csv_file:
writer = csv.writer(csv_file)#, quoting=csv.QUOTE_MINIMAL)
for item in data:
item_values = []
for key in item:
if write_header:
item_keys.append(key)
value = item.get(key, '')
if (type(value)==t):
item_values.append(value.encode('utf-8'))
else:
item_values.append(value)
if write_header:
writer.writerow(item_keys)
write_header = False
writer.writerow(item_values)
The below code will convert the json file ( data3.json ) to csv file ( data3.csv ).
import json
import csv
with open("/Users/Desktop/json/data3.json") as file:
data = json.load(file)
file.close()
print(data)
fname = "/Users/Desktop/json/data3.csv"
with open(fname, "w", newline='') as file:
csv_file = csv.writer(file)
csv_file.writerow(['dept',
'class',
'subclass'])
for item in data["item_data"]:
csv_file.writerow([item.get('item_data').get('dept'),
item.get('item_data').get('class'),
item.get('item_data').get('subclass')])
The above mentioned code has been executed in the locally installed pycharm and it has successfully converted the json file to the csv file. Hope this help to convert the files.
Since the data appears to be in a dictionary format, it would appear that you should actually use csv.DictWriter() to actually output the lines with the appropriate header information. This should allow the conversion to be handled somewhat easier. The fieldnames parameter would then set up the order properly while the output of the first line as the headers would allow it to be read and processed later by csv.DictReader().
For example, Mike Repass used
output = csv.writer(sys.stdout)
output.writerow(data[0].keys()) # header row
for row in data:
output.writerow(row.values())
However just change the initial setup to
output = csv.DictWriter(filesetting, fieldnames=data[0].keys())
Note that since the order of elements in a dictionary is not defined, you might have to create fieldnames entries explicitly. Once you do that, the writerow will work. The writes then work as originally shown.
Unfortunately I have not enouthg reputation to make a small contribution to the amazing @Alec McGail answer.
I was using Python3 and I have needed to convert the map to a list following the @Alexis R comment.
Additionaly I have found the csv writer was adding a extra CR to the file (I have a empty line for each line with data inside the csv file). The solution was very easy following the @Jason R. Coombs answer to this thread:
CSV in Python adding an extra carriage return
You need to simply add the lineterminator=’\n’ parameter to the csv.writer. It will be: csv_w = csv.writer( out_file, lineterminator='\n' )
import os
import pandas as pd
import json
import numpy as np
data =[]
os.chdir('D:\\Your_directory\\folder')with open('file_name.json', encoding="utf8")as data_file:for line in data_file:
data.append(json.loads(line))
dataframe = pd.DataFrame(data)## Saving the dataframe to a csv file
dataframe.to_csv("filename.csv", encoding='utf-8',index=False)
You can use this code to convert a json file to csv file
After reading the file, I am converting the object to pandas dataframe and then saving this to a CSV file
import os
import pandas as pd
import json
import numpy as np
data = []
os.chdir('D:\\Your_directory\\folder')
with open('file_name.json', encoding="utf8") as data_file:
for line in data_file:
data.append(json.loads(line))
dataframe = pd.DataFrame(data)
## Saving the dataframe to a csv file
dataframe.to_csv("filename.csv", encoding='utf-8',index= False)
回答 24
我可能参加聚会晚了,但我认为,我已经解决了类似的问题。我有一个看起来像这样的json文件
我只想从这些json文件中提取一些键/值。因此,我编写了以下代码以提取相同的代码。
"""json_to_csv.py
This script reads n numbers of json files present in a folder and then extract certain data from each file and write in a csv file.
The folder contains the python script i.e. json_to_csv.py, output.csv and another folder descriptions containing all the json files.
"""import os
import json
import csv
def get_list_of_json_files():"""Returns the list of filenames of all the Json files present in the folder
Parameter
---------
directory : str
'descriptions' in this case
Returns
-------
list_of_files: list
List of the filenames of all the json files
"""
list_of_files = os.listdir('descriptions')# creates list of all the files in the folderreturn list_of_files
def create_list_from_json(jsonfile):"""Returns a list of the extracted items from json file in the same order we need it.
Parameter
_________
jsonfile : json
The json file containing the data
Returns
-------
one_sample_list : list
The list of the extracted items needed for the final csv
"""with open(jsonfile)as f:
data = json.load(f)
data_list =[]# create an empty list# append the items to the list in the same order.
data_list.append(data['_id'])
data_list.append(data['_modelType'])
data_list.append(data['creator']['_id'])
data_list.append(data['creator']['name'])
data_list.append(data['dataset']['_accessLevel'])
data_list.append(data['dataset']['_id'])
data_list.append(data['dataset']['description'])
data_list.append(data['dataset']['name'])
data_list.append(data['meta']['acquisition']['image_type'])
data_list.append(data['meta']['acquisition']['pixelsX'])
data_list.append(data['meta']['acquisition']['pixelsY'])
data_list.append(data['meta']['clinical']['age_approx'])
data_list.append(data['meta']['clinical']['benign_malignant'])
data_list.append(data['meta']['clinical']['diagnosis'])
data_list.append(data['meta']['clinical']['diagnosis_confirm_type'])
data_list.append(data['meta']['clinical']['melanocytic'])
data_list.append(data['meta']['clinical']['sex'])
data_list.append(data['meta']['unstructured']['diagnosis'])# In few json files, the race was not there so using KeyError exception to add '' at the placetry:
data_list.append(data['meta']['unstructured']['race'])exceptKeyError:
data_list.append("")# will add an empty string in case race is not there.
data_list.append(data['name'])return data_list
def write_csv():"""Creates the desired csv file
Parameters
__________
list_of_files : file
The list created by get_list_of_json_files() method
result.csv : csv
The csv file containing the header only
Returns
_______
result.csv : csv
The desired csv file
"""
list_of_files = get_list_of_json_files()for file in list_of_files:
row = create_list_from_json(f'descriptions/{file}')# create the row to be added to csv for each file (json-file)with open('output.csv','a')as c:
writer = csv.writer(c)
writer.writerow(row)
c.close()if __name__ =='__main__':
write_csv()
I might be late to the party, but I think, I have dealt with the similar problem. I had a json file which looked like this
I only wanted to extract few keys/values from these json file. So, I wrote the following code to extract the same.
"""json_to_csv.py
This script reads n numbers of json files present in a folder and then extract certain data from each file and write in a csv file.
The folder contains the python script i.e. json_to_csv.py, output.csv and another folder descriptions containing all the json files.
"""
import os
import json
import csv
def get_list_of_json_files():
"""Returns the list of filenames of all the Json files present in the folder
Parameter
---------
directory : str
'descriptions' in this case
Returns
-------
list_of_files: list
List of the filenames of all the json files
"""
list_of_files = os.listdir('descriptions') # creates list of all the files in the folder
return list_of_files
def create_list_from_json(jsonfile):
"""Returns a list of the extracted items from json file in the same order we need it.
Parameter
_________
jsonfile : json
The json file containing the data
Returns
-------
one_sample_list : list
The list of the extracted items needed for the final csv
"""
with open(jsonfile) as f:
data = json.load(f)
data_list = [] # create an empty list
# append the items to the list in the same order.
data_list.append(data['_id'])
data_list.append(data['_modelType'])
data_list.append(data['creator']['_id'])
data_list.append(data['creator']['name'])
data_list.append(data['dataset']['_accessLevel'])
data_list.append(data['dataset']['_id'])
data_list.append(data['dataset']['description'])
data_list.append(data['dataset']['name'])
data_list.append(data['meta']['acquisition']['image_type'])
data_list.append(data['meta']['acquisition']['pixelsX'])
data_list.append(data['meta']['acquisition']['pixelsY'])
data_list.append(data['meta']['clinical']['age_approx'])
data_list.append(data['meta']['clinical']['benign_malignant'])
data_list.append(data['meta']['clinical']['diagnosis'])
data_list.append(data['meta']['clinical']['diagnosis_confirm_type'])
data_list.append(data['meta']['clinical']['melanocytic'])
data_list.append(data['meta']['clinical']['sex'])
data_list.append(data['meta']['unstructured']['diagnosis'])
# In few json files, the race was not there so using KeyError exception to add '' at the place
try:
data_list.append(data['meta']['unstructured']['race'])
except KeyError:
data_list.append("") # will add an empty string in case race is not there.
data_list.append(data['name'])
return data_list
def write_csv():
"""Creates the desired csv file
Parameters
__________
list_of_files : file
The list created by get_list_of_json_files() method
result.csv : csv
The csv file containing the header only
Returns
_______
result.csv : csv
The desired csv file
"""
list_of_files = get_list_of_json_files()
for file in list_of_files:
row = create_list_from_json(f'descriptions/{file}') # create the row to be added to csv for each file (json-file)
with open('output.csv', 'a') as c:
writer = csv.writer(c)
writer.writerow(row)
c.close()
if __name__ == '__main__':
write_csv()
I hope this will help. For details on how this code work you can check here
I am trying to create a .csv file with the values from a Python list. When I print the values in the list they are all unicode (?), i.e. they look something like this
[u'value 1', u'value 2', ...]
If I iterate through the values in the list i.e. for v in mylist: print v they appear to be plain text.
And I can put a , between each with print ','.join(mylist)
Use python’s csv module for reading and writing comma or tab-delimited files. The csv module is preferred because it gives you good control over quoting.
For example, here is the worked example for you:
import csv
data = ["value %d" % i for i in range(1,4)]
out = csv.writer(open("myfile.csv","w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)
Produces:
"value 1","value 2","value 3"
回答 5
在这种情况下,您可以使用string.join方法。
为了清晰起见,请分成几行-这是一个互动式会议
>>> a =['a','b','c']>>> first ='", "'.join(a)>>> second ='"%s"'% first
>>>print second
"a","b","c"
You could use the string.join method in this case.
Split over a few of lines for clarity – here’s an interactive session
>>> a = ['a','b','c']
>>> first = '", "'.join(a)
>>> second = '"%s"' % first
>>> print second
"a", "b", "c"
Or as a single line
>>> print ('"%s"') % '", "'.join(a)
"a", "b", "c"
However, you may have a problem is your strings have got embedded quotes. If this is the case you’ll need to decide how to escape them.
The CSV module can take care of all of this for you, allowing you to choose between various quoting options (all fields, only fields with quotes and seperators, only non numeric fields, etc) and how to esacpe control charecters (double quotes, or escaped strings). If your values are simple, string.join will probably be OK but if you’re having to manage lots of edge cases, use the module available.
回答 6
这个解决方案听起来很疯狂,但是像蜂蜜一样平稳
import csv
with open('filename','wb')as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
wr.writerow(mylist)
This solutions sounds crazy, but works smooth as honey
import csv
with open('filename', 'wb') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
wr.writerow(mylist)
The file is being written by csvwriter hence csv properties are maintained i.e. comma separated.
The delimiter helps in the main part by moving list items to next line, each time.
The below example demonstrate creating and writing a csv file.
to make a dynamic file writer we need to import a package import csv, then need to create an instance of the file with file reference
Ex:- with open(“D:\sample.csv”,”w”,newline=””) as file_writer
here if the file does not exist with the mentioned file directory then python will create a same file in the specified directory, and “w” represents write, if you want to read a file then replace “w” with “r” or to append to existing file then “a”. newline=”” specifies that it removes an extra empty row for every time you create row so to eliminate empty row we use newline=””, create some field names(column names) using list like fields=[“Names”,”Age”,”Class”], then apply to writer instance like
writer=csv.DictWriter(file_writer,fieldnames=fields)
here using Dictionary writer and assigning column names, to write column names to csv we use writer.writeheader() and to write values we use writer.writerow({“Names”:”John”,”Age”:20,”Class”:”12A”}) ,while writing file values must be passed using dictionary method , here the key is column name and value is your respective key value
import csv
with open("D:\\sample.csv","w",newline="") as file_writer:
fields=["Names","Age","Class"]
writer=csv.DictWriter(file_writer,fieldnames=fields)
writer.writeheader()
writer.writerow({"Names":"John","Age":21,"Class":"12A"})
回答 8
Jupyter笔记本
假设您的清单是 A
然后,您可以编码以下广告,将其作为csv文件保存(仅列!)
R="\n".join(A)
f = open('Columns.csv','w')
f.write(R)
f.close()
import csv, codecs, cStringIO
class UTF8Recoder:"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)def __iter__(self):return self
def next(self):return self.reader.next().encode("utf-8")classUnicodeReader:"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""def __init__(self, f, dialect=csv.excel, encoding="utf-8",**kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect,**kwds)def next(self):
row = self.reader.next()return[unicode(s,"utf-8")for s in row]def __iter__(self):return self
classUnicodeWriter:"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""def __init__(self, f, dialect=csv.excel, encoding="utf-8",**kwds):# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect,**kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()def writerow(self, row):
self.writer.writerow([s.encode("utf-8")for s in row])# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")# ... and reencode it into the target encoding
data = self.encoder.encode(data)# write to the target stream
self.stream.write(data)# empty queue
self.queue.truncate(0)def writerows(self, rows):for row in rows:
self.writerow(row)
you should use the CSV module for sure , but the chances are , you need to write unicode . For those Who need to write unicode , this is the class from example page , that you can use as a util module:
import csv, codecs, cStringIO
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
回答 10
这是不需要csv模块的另一种解决方案。
print', '.join(['"'+i+'"'for i in myList])
范例:
>>> myList =[u'value 1', u'value 2', u'value 3']>>>print', '.join(['"'+i+'"'for i in myList])"value 1","value 2","value 3"
# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examplesimport csv
with open('some.csv', newline='')as f:
reader = csv.reader(f)for row in reader:print(row)
但是,这会在某些csv文件上引发以下错误:
_csv.Error: field larger than field limit (131072)
I have a script reading in a csv file with very huge fields:
# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row)
However, this throws the following error on some csv files:
_csv.Error: field larger than field limit (131072)
How can I analyze csv files with huge fields? Skipping the lines with huge fields is not an option as the data needs to be analyzed in subsequent steps.
正如Geoff指出的那样,以上代码可能会导致以下错误:OverflowError: Python int too large to convert to C long。为了避免这种情况,您可以使用以下快速而又肮脏的代码(该代码应该在使用Python 2和Python 3的每个系统上都可以使用):
import sys
import csv
maxInt = sys.maxsize
whileTrue:# decrease the maxInt value by factor 10 # as long as the OverflowError occurs.try:
csv.field_size_limit(maxInt)breakexceptOverflowError:
maxInt = int(maxInt/10)
As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long.
To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):
import sys
import csv
maxInt = sys.maxsize
while True:
# decrease the maxInt value by factor 10
# as long as the OverflowError occurs.
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
>>>import sys
>>>>>> sys.platform, sys.maxsize
('win32',9223372036854775807)>>>>>> csv.field_size_limit(sys.maxsize)Traceback(most recent call last):File"<stdin>", line 1,in<module>OverflowError:Python int too large to convert to C long
However, when dealing with a .csv file (with the correct quoting and delimiter) having (at least) one field longer than this size, the error pops up. To get rid of the error, the size limit should be increased (to avoid any worries, the maximum possible value is attempted).
Behind the scenes (check [GitHub]: python/cpython – (master) cpython/Modules/_csv.c for implementation details), the variable that holds this value is a C long ([Wikipedia]: C data types), whose size varies depending on CPU architecture and OS (ILP). The classical difference: for a 64bitOS (Python build), the long type size (in bits) is:
Nix: 64
Win: 32
When attempting to set it, the new value is checked to be in the long boundaries, that’s why in some cases another exception pops up (this case is common on Win):
>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long
I just had this happen to me on a ‘plain’ CSV file. Some people might call it an invalid formatted file. No escape characters, no double quotes and delimiter was a semicolon.
A sample line from this file would look like this:
First cell; Second ” Cell with one double quote and leading
space;’Partially quoted’ cell;Last cell
the single quote in the second cell would throw the parser off its rails. What worked was:
import io
import re
import pandas as pd
def read_psv(str_input: str,**kwargs)-> pd.DataFrame:"""Read a Pandas object from a pipe-separated table contained within a string.
Input example:
| int_score | ext_score | eligible |
| | 701 | True |
| 221.3 | 0 | False |
| | 576 | True |
| 300 | 600 | True |
The leading and trailing pipes are optional, but if one is present,
so must be the other.
`kwargs` are passed to `read_csv`. They must not include `sep`.
In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can
be used to neatly format a table.
Ref: https://stackoverflow.com/a/46471952/
"""
substitutions =[('^ *',''),# Remove leading spaces(' *$',''),# Remove trailing spaces(r' *\| *','|'),# Remove spaces between columns]if all(line.lstrip().startswith('|')and line.rstrip().endswith('|')for line in str_input.strip().split('\n')):
substitutions.extend([(r'^\|',''),# Remove redundant leading delimiter(r'\|$',''),# Remove redundant trailing delimiter])for pattern, replacement in substitutions:
str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)return pd.read_csv(io.StringIO(str_input), sep='|',**kwargs)
This answer applies when a string is manually entered, not when it’s read from somewhere.
A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.
Store the following in a utility module, e.g. util/pandas.py. An example is included in the function’s docstring.
import io
import re
import pandas as pd
def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
"""Read a Pandas object from a pipe-separated table contained within a string.
Input example:
| int_score | ext_score | eligible |
| | 701 | True |
| 221.3 | 0 | False |
| | 576 | True |
| 300 | 600 | True |
The leading and trailing pipes are optional, but if one is present,
so must be the other.
`kwargs` are passed to `read_csv`. They must not include `sep`.
In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can
be used to neatly format a table.
Ref: https://stackoverflow.com/a/46471952/
"""
substitutions = [
('^ *', ''), # Remove leading spaces
(' *$', ''), # Remove trailing spaces
(r' *\| *', '|'), # Remove spaces between columns
]
if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
substitutions.extend([
(r'^\|', ''), # Remove redundant leading delimiter
(r'\|$', ''), # Remove redundant trailing delimiter
])
for pattern, replacement in substitutions:
str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)
Non-working alternatives
The code below doesn’t work properly because it adds an empty column on both the left and right sides.
As for read_fwf, it doesn’t actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn’t be used at all for pipe-separated data.