标签归档:csv

使用csv模块从csv文件中读取特定列?

问题:使用csv模块从csv文件中读取特定列?

我正在尝试解析一个csv文件,并仅从特定列中提取数据。

范例csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

我想只捕获特定的列,说IDNameZipPhone

我看过的代码使我相信我可以通过其对应的编号来调用特定的列,即:Name将使用对应2并遍历每一行将row[2]产生列2中的所有项。只有这样,它才不会。

到目前为止,这是我所做的:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

并且我希望这只会打印出我想要的每一行的特定列,除非不是,我只会得到最后一列。

I’m trying to parse through a csv file and extract the data from only specific columns.

Example csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

I’m trying to capture only specific columns, say ID, Name, Zip and Phone.

Code I’ve looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn’t.

Here’s what I’ve done so far:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

and I’m expecting that this will print out only the specific columns I want for each row except it doesn’t, I get the last column only.


回答 0

你会得到从这个代码的最后一列的唯一方法是,如果你不包括你的print语句for循环。

这很可能是代码的结尾:

for row in reader:
    content = list(row[i] for i in included_cols)
print content

您希望它是这样的:

for row in reader:
        content = list(row[i] for i in included_cols)
        print content

既然我们已经解决了您的错误,那么我想花时间向您介绍pandas模块。

Pandas在处理csv文件方面非常出色,下面的代码将是您读取csv并将整列保存到变量中所需的全部:

import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']

因此,如果您想将列中的所有信息保存Names到变量中,则只需执行以下操作:

names = df.Names

这是一个很棒的模块,建议您研究一下。如果由于某种原因您的打印语句处于for循环状态,并且仍然仅打印出最后一列,则不应该发生,但是请让我知道我的假设是否错误。您发布的代码有很多缩进错误,因此很难知道应该在哪里。希望这对您有所帮助!

The only way you would be getting the last column from this code is if you don’t include your print statement in your for loop.

This is most likely the end of your code:

for row in reader:
    content = list(row[i] for i in included_cols)
print content

You want it to be this:

for row in reader:
        content = list(row[i] for i in included_cols)
        print content

Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.

Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:

import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']

so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:

names = df.Names

It’s a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn’t happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!


回答 1

import csv
from collections import defaultdict

columns = defaultdict(list) # each value in each column is appended to a list

with open('file.txt') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

print(columns['name'])
print(columns['phone'])
print(columns['street'])

带有类似的文件

name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.

将输出

>>> 
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']

或者,如果您希望对列进行数字索引:

with open('file.txt') as f:
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        for (i,v) in enumerate(row):
            columns[i].append(v)
print(columns[0])

>>> 
['Bob', 'James', 'Smithers']

要更改分隔符,请添加delimiter=" "适当的实例,即reader = csv.reader(f,delimiter=" ")

import csv
from collections import defaultdict

columns = defaultdict(list) # each value in each column is appended to a list

with open('file.txt') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

print(columns['name'])
print(columns['phone'])
print(columns['street'])

With a file like

name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.

Will output

>>> 
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']

Or alternatively if you want numerical indexing for the columns:

with open('file.txt') as f:
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        for (i,v) in enumerate(row):
            columns[i].append(v)
print(columns[0])

>>> 
['Bob', 'James', 'Smithers']

To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")


回答 2

使用熊猫

import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']

在解析时丢弃不需要的列:

my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

PS:我只是以一种简单的方式来汇总别人的话。实际的答案是从这里这里

Use pandas:

import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']

Discard unneeded columns at parse time:

my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

P.S. I’m just aggregating what other’s have said in a simple manner. Actual answers are taken from here and here.


回答 3

随着熊猫,你可以使用read_csv带有usecols参数:

df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

例:

import pandas as pd
import io

s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''

df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)

   total_bill  day  size
0       16.99  Sun     2
1       10.34  Sun     3
2       21.01  Sun     3

With pandas you can use read_csv with usecols parameter:

df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

Example:

import pandas as pd
import io

s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''

df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)

   total_bill  day  size
0       16.99  Sun     2
1       10.34  Sun     3
2       21.01  Sun     3

回答 4

您可以使用numpy.loadtext(filename)。例如,如果这是您的数据库.csv

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

您想要该Name列:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

您可以更轻松地使用genfromtext

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

You can use numpy.loadtext(filename). For example if this is your database .csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

And you want the Name column:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

More easily you can use genfromtext:

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

回答 5

上下文:对于这类工作,您应该使用令人惊叹的python petl库。通过标准的csv模块“手动”执行操作,可以节省大量工作和潜在的挫败感。AFAIK,唯一仍在使用csv模块的人是尚未发现更好的工具来处理表格数据(熊猫,petl等)的人,这很好,但是如果您打算在其中处理大量数据,您可以从各种各样的陌生来源获得职业,学习像petl这样的东西是您可以做出的最好的投资之一。pip安装petl后,只需30分钟即可开始使用。该文档非常好。

答:假设您在csv文件中拥有第一个表(也可以使用petl直接从数据库中加载)。然后,您只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv 

#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things ‘manually’ with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you’ve done pip install petl. The documentation is excellent.

Answer: Let’s say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

from petl import fromcsv, look, cut, tocsv 

#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

回答 6

我认为有一个更简单的方法

import pandas as pd

dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values

因此在这里iloc[:, 0]:表示所有值,0表示列的位置。在下面的示例ID中将被选中

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

I think there is an easier way

import pandas as pd

dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values

So in here iloc[:, 0], : means all values, 0 means the position of the column. in the example below ID will be selected

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

回答 7

import pandas as pd 
csv_file = pd.read_csv("file.csv") 
column_val_list = csv_file.column_name._ndarray_values
import pandas as pd 
csv_file = pd.read_csv("file.csv") 
column_val_list = csv_file.column_name._ndarray_values

回答 8

多亏了您可以为熊猫数据帧建立索引并对其进行子集化的一种方法,一种将csv文件中的单个列提取到变量中的非常简单的方法是:

myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']

需要考虑的几件事:

上面的代码片断会产生大熊猫Series并没有dataframeusecols如果速度是一个问题,ayhan和ayhan的建议也会更快。使用以下方法测试两种不同的方法%timeit大小为2122 KB的csv文件,将产生22.8 msusecols方法和53 ms我建议的方法。

别忘了 import pandas as pd

Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:

myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']

A few things to consider:

The snippet above will produce a pandas Series and not dataframe. The suggestion from ayhan with usecols will also be faster if speed is an issue. Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.

And don’t forget import pandas as pd


回答 9

如果您需要分别处理这些列,那么我想使用zip(*iterable)模式来对这些列进行解构(有效地“解压缩”)。因此,对于您的示例:

ids, names, zips, phones = zip(*(
  (row[1], row[2], row[6], row[7])
  for row in reader
))

If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively “unzip”). So for your example:

ids, names, zips, phones = zip(*(
  (row[1], row[2], row[6], row[7])
  for row in reader
))

回答 10

抓取列名,而不是使用readlines方法()更好地使用的ReadLine() ,以避免循环和读取的完整文件&其存储在数组中。

with open(csv_file, 'rb') as csvfile:

    # get number of columns

    line = csvfile.readline()

    first_item = line.split(',')

To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.

with open(csv_file, 'rb') as csvfile:

    # get number of columns

    line = csvfile.readline()

    first_item = line.split(',')

网址中的熊猫read_csv

问题:网址中的熊猫read_csv

我将Python 3.4与IPython结合使用,并具有以下代码。我无法从给定的URL读取csv文件:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

我有以下错误

“预期的文件路径名或类文件对象,得到类型”

我怎样才能解决这个问题?

I am using Python 3.4 with IPython and have the following code. I’m unable to read a csv-file from the given URL:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

I have the following error

“Expected file path name or file-like object, got type”

How can I fix this?


回答 0

更新资料

0.19.2现在,您可以从熊猫直接传递URL


正如错误所暗示的,pandas.read_csv需要一个类似文件的对象作为第一个参数。

如果要从字符串读取csv,可以使用io.StringIO(Python 3.x)或StringIO.StringIO(Python 2.x)

另外,对于URL- https://github.com/cs109/2014_data/blob/master/countries.csv-您正在获得html响应,而不是原始的csv,您应该使用Rawgithub页面中的链接给出的url 获取原始的csv响应-https: //raw.githubusercontent.com/cs109/2014_data/master/countries.csv

范例-

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Update

From pandas 0.19.2 you can now just pass the url directly.


Just as the error suggests, pandas.read_csv needs a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO (Python 3.x) or StringIO.StringIO (Python 2.x) .

Also, for the URL – https://github.com/cs109/2014_data/blob/master/countries.csv – you are getting back html response , not raw csv, you should use the url given by the Raw link in the github page for getting raw csv response , which is – https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example –

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

回答 1

在最新版本的pandas(0.19.2)中,您可以直接传递网址

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

回答 2

正如我评论的那样,您需要使用StringIO对象并进行解码,即c=pd.read_csv(io.StringIO(s.decode("utf-8")))如果使用请求,则需要进行解码,因为如果您使用.text ,则content会返回字节,您只需要像s = requests.get(url).textc = 那样传递s即可pd.read_csv(StringIO(s))

一种更简单的方法是将原始数据的正确url 直接传递给read_csv,您不必传递像object这样的文件,您可以传递url从而根本不需要请求:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

输出:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

文档

filepath_or_buffer

字符串或文件句柄/ StringIO字符串可以是URL。有效的URL方案包括http,ftp,s3和file。对于文件URL,需要一个主机。例如,本地文件可以是文件://localhost/path/to/table.csv

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don’t have to pass a file like object, you can pass a url so you don’t need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv


回答 3

您遇到的问题是,进入变量s的输出不是csv,而是html文件。为了获得原始的csv,您必须将url修改为:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

您的第二个问题是read_csv需要一个文件名,我们可以通过使用io模块中的StringIO来解决此问题。第三个问题是request.get(url).content提供了字节流,我们可以改用request.get(url).text解决。

最终结果是此代码:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

输出:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

The problem you’re having is that the output you get into the variable ‘s’ is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

回答 4

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")

回答 5

要通过熊猫中的URL导入数据,只需应用下面的简单代码即可,实际上效果更好。

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

如果您对原始数据有疑问,则只需在网址前添加“ r”

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put ‘r’ before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

如何将tsv文件加载到Pandas DataFrame中?

问题:如何将tsv文件加载到Pandas DataFrame中?

我是python和pandas的新手。我正在尝试将tsv文件加载到熊猫中DataFrame

这是我正在尝试的错误:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

I’m new to python and pandas. I’m trying to get a tsv file loaded into a pandas DataFrame.

This is what I’m trying and the error I’m getting:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

回答 0

:由于17.0 from_csv气馁:使用pd.read_csv替代

该文档列出了一个.from_csv函数,该函数似乎可以执行您想要的操作:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

如果您有标题,则可以传递header=0

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

Note: As of 17.0 from_csv is discouraged: use pd.read_csv instead

The documentation lists a .from_csv function that appears to do what you want:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

If you have a header, you can pass header=0.

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

回答 1

从17.0开始from_csv不建议使用。

使用pd.read_csv(fpath, sep='\t')pd.read_table(fpath)

As of 17.0 from_csv is discouraged.

Use pd.read_csv(fpath, sep='\t') or pd.read_table(fpath).


回答 2

使用read_table(filepath)。默认分隔符是制表符

Use read_table(filepath). The default separator is tab


回答 3

试试这个

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

您实际上需要修复sep参数。

Try this

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

You actually need to fix the sep parameter.


回答 4

打开文件,另存为.csv,然后应用

df = pd.read_csv('apps.csv', sep='\t')

对于任何其他格式,只需更改sep标记

open file, save as .csv and then apply

df = pd.read_csv('apps.csv', sep='\t')

for any other format also, just change the sep tag


回答 5

df = pd.read_csv('filename.csv', sep='\t', header=0)

您可以通过指定分隔符和标头将tsv文件直接加载到pandas数据框中。

df = pd.read_csv('filename.csv', sep='\t', header=0)

You can load the tsv file directly into pandas data frame by specifying delimitor and header.


熊猫中的datetime dtypes read_csv

问题:熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

运行时出现错误:

TypeError:不了解数据类型“ datetime”

事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。

另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

I’m reading in a csv file with multiple datetime columns. I’d need to set the data types upon reading in the file, but datetimes appear to be a problem. For instance:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

When run gives a error:

TypeError: data type “datetime” not understood

Converting columns after the fact, via pandas.to_datetime() isn’t an option I can’t know which columns will be datetime objects. That information can change and comes from whatever informs my dtypes list.

Alternatively, I’ve tried to load the csv file with numpy.genfromtxt, set the dtypes in that function, and then convert to a pandas.dataframe but it garbles the data. Any help is greatly appreciated!


回答 0

为什么它不起作用

没有为read_csv设置datetime dtype,因为csv文件只能包含字符串,整数和浮点数。

将dtype设置为datetime将使熊猫将datetime解释为对象,这意味着您将以字符串结尾。

熊猫解决这个问题的方法

pandas.read_csv()函数具有名为parse_dates

使用此功能,您可以使用默认date_parserdateutil.parser.parser)快速将字符串,浮点数或整数转换为日期时间

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

这将导致熊猫读取col1col2作为字符串,它们很可能是字符串(“ 2016-05-05”等),并且在读取字符串之后,每一列的date_parser都会对该字符串起作用,并返回该函数返回的任何内容。

定义自己的日期解析功能:

pandas.read_csv()函数具有名为date_parser

将其设置为lambda函数将使该特定函数可用于日期解析。

GOTCHA警告

您必须为其提供功能,而不是功能的执行,因此这是正确的

date_parser = pd.datetools.to_datetime

这是不正确的

date_parser = pd.datetools.to_datetime()

熊猫0.22更新

pd.datetools.to_datetime 已移至 date_parser = pd.to_datetime

谢谢@stackoverYC

Why it does not work

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats.

Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

Pandas way of solving this

The pandas.read_csv() function has a keyword argument called parse_dates

Using this you can on the fly convert strings, floats or integers into datetimes using the default date_parser (dateutil.parser.parser)

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

This will cause pandas to read col1 and col2 as strings, which they most likely are (“2016-05-05” etc.) and after having read the string, the date_parser for each column will act upon that string and give back whatever that function returns.

Defining your own date parsing function:

The pandas.read_csv() function also has a keyword argument called date_parser

Setting this to a lambda function will make that particular function be used for the parsing of the dates.

GOTCHA WARNING

You have to give it the function, not the execution of the function, thus this is Correct

date_parser = pd.datetools.to_datetime

This is incorrect:

date_parser = pd.datetools.to_datetime()

Pandas 0.22 Update

pd.datetools.to_datetime has been relocated to date_parser = pd.to_datetime

Thanks @stackoverYC


回答 1

有一个parse_dates参数read_csv可让您定义要视为日期或日期时间的列的名称:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

回答 2

您可以尝试传递实际类型而不是字符串。

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

但是,如果没有任何可修改的数据,将很难诊断出来。

实际上,您可能希望熊猫将日期解析为时间戳记,因此可能是:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

You might try passing actual types instead of strings.

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

But it’s going to be really hard to diagnose this without any of your data to tinker with.

And really, you probably want pandas to parse the the dates into TimeStamps, so that might be:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

回答 3

我尝试使用dtypes = [datetime,…]选项,但是

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

我遇到以下错误:

TypeError: data type not understood

我唯一要做的更改是将datetime替换为datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I tried using the dtypes=[datetime, …] option, but

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I encountered the following error:

TypeError: data type not understood

The only change I had to make is to replace datetime with datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

将Dataframe保存到csv直接保存到s3 Python

问题:将Dataframe保存到csv直接保存到s3 Python

我有一个要上传到新CSV文件的pandas DataFrame。问题是在将文件传输到s3之前,我不想在本地保存文件。是否有像to_csv这样的方法可以将数据帧直接写入s3?我正在使用boto3。
这是我到目前为止的内容:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don’t want to save the file locally before transferring it to s3. Is there any method like to_csv for writing the dataframe to s3 directly? I am using boto3.
Here is what I have so far:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

回答 0

您可以使用:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())

You can use:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())

回答 1

您可以直接使用S3路径。我正在使用Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

发行公告:

S3文件处理

熊猫现在使用s3fs处理S3连接。这不应破坏任何代码。但是,由于s3fs不是必需的依赖项,因此您将需要单独安装它,例如以前版本的panda中的boto。GH11915

You can directly use the S3 path. I am using Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.


回答 2

我喜欢s3fs,它使您可以像本地文件系统一样(几乎)使用s3。

你可以这样做:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs只支持rbwb打开文件,这就是为什么我做这个模式bytes_to_write的东西。

I like s3fs which lets you use s3 (almost) like a local filesystem.

You can do this:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs supports only rb and wb modes of opening the file, that’s why I did this bytes_to_write stuff.


回答 3

这是最新的答案:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

StringIO的问题在于它将吞噬您的内存。使用此方法,您将文件流式传输到s3,而不是将其转换为字符串,然后将其写入s3。将pandas数据框及其字符串副本保存在内存中似乎效率很低。

如果您在ec2 Instant中工作,则可以为其赋予IAM角色以使其能够写入s3,因此您无需直接传递凭据。但是,您也可以通过将凭据传递给S3FileSystem()功能来连接到存储桶。请参阅文档:https : //s3fs.readthedocs.io/en/latest/

This is a more up to date answer:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/


回答 4

如果None将第一个参数传递to_csv()给数据,则将以字符串形式返回。从那里开始,只需一步即可将其上传到S3。

也可以将一个StringIO对象传递给to_csv(),但是使用字符串会更容易。

If you pass None as the first argument to to_csv() the data will be returned as a string. From there it’s an easy step to upload that to S3 in one go.

It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.


回答 5

您还可以使用AWS Data Wrangler

import awswrangler

session = awswrangler.Session()
session.pandas.to_csv(
    dataframe=df,
    path="s3://...",
)

请注意,由于它是并行上传的,因此它将分为几部分。

You can also use the AWS Data Wrangler:

import awswrangler as wr
    
wr.s3.to_csv(
    df=df,
    path="s3://...",
)

Note that it will handle multipart upload for you to make the upload faster.


回答 6

我发现也可以使用client,而不仅仅是resource

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

I found this can be done using client also and not just resource.

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

回答 7

由于您正在使用boto3.client(),请尝试:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

since you are using boto3.client(), try:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

回答 8

我找到了一个似乎很有效的简单解决方案:

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

希望能有所帮助!

I found a very simple solution that seems to be working :

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

Hope that helps !


回答 9

我从存储桶s3中读取了两列的csv,并将文件csv的内容放入了pandas数据框。

例:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()

I read a csv with two columns from bucket s3, and the content of the file csv i put in pandas dataframe.

Example:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()

CSV新行字符出现在未引用字段错误

问题:CSV新行字符出现在未引用字段错误

以下代码一直工作到今天,当我从Windows机器导入并出现此错误时:

在不带引号的字段中看到换行符-您是否需要在通用换行模式下打开文件?

import csv

class CSV:


    def __init__(self, file=None):
        self.file = file

    def read_file(self):
        data = []
        file_read = csv.reader(self.file)
        for row in file_read:
            data.append(row)
        return data

    def get_row_count(self):
        return len(self.read_file())

    def get_column_count(self):
        new_data = self.read_file()
        return len(new_data[0])

    def get_data(self, rows=1):
        data = self.read_file()

        return data[:rows]

如何解决此问题?

def upload_configurator(request, id=None):
    """
    A view that allows the user to configurator the uploaded CSV.
    """
    upload = Upload.objects.get(id=id)
    csvobject = CSV(upload.filepath)

    upload.num_records = csvobject.get_row_count()
    upload.num_columns = csvobject.get_column_count()
    upload.save()

    form = ConfiguratorForm()

    row_count = csvobject.get_row_count()
    colum_count = csvobject.get_column_count()
    first_row = csvobject.get_data(rows=1)
    first_two_rows = csvobject.get_data(rows=5)

the following code worked until today when I imported from a Windows machine and got this error:

new-line character seen in unquoted field – do you need to open the file in universal-newline mode?

import csv

class CSV:


    def __init__(self, file=None):
        self.file = file

    def read_file(self):
        data = []
        file_read = csv.reader(self.file)
        for row in file_read:
            data.append(row)
        return data

    def get_row_count(self):
        return len(self.read_file())

    def get_column_count(self):
        new_data = self.read_file()
        return len(new_data[0])

    def get_data(self, rows=1):
        data = self.read_file()

        return data[:rows]

How can I fix this issue?

def upload_configurator(request, id=None):
    """
    A view that allows the user to configurator the uploaded CSV.
    """
    upload = Upload.objects.get(id=id)
    csvobject = CSV(upload.filepath)

    upload.num_records = csvobject.get_row_count()
    upload.num_columns = csvobject.get_column_count()
    upload.save()

    form = ConfiguratorForm()

    row_count = csvobject.get_row_count()
    colum_count = csvobject.get_column_count()
    first_row = csvobject.get_data(rows=1)
    first_two_rows = csvobject.get_data(rows=5)

回答 0

最好先查看csv文件本身,但这可能对您有用,请尝试一下,替换:

file_read = csv.reader(self.file)

与:

file_read = csv.reader(self.file, dialect=csv.excel_tab)

或者,使用打开文件universal newline mode并将其传递给csv.reader,例如:

reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)

或者,splitlines()像这样使用:

def read_file(self):
    with open(self.file, 'r') as f:
        data = [row for row in csv.reader(f.read().splitlines())]
    return data

It’ll be good to see the csv file itself, but this might work for you, give it a try, replace:

file_read = csv.reader(self.file)

with:

file_read = csv.reader(self.file, dialect=csv.excel_tab)

Or, open a file with universal newline mode and pass it to csv.reader, like:

reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)

Or, use splitlines(), like this:

def read_file(self):
    with open(self.file, 'r') as f:
        data = [row for row in csv.reader(f.read().splitlines())]
    return data

回答 1

我意识到这是一篇过时的文章,但是遇到了同样的问题,但没有找到正确的答案,因此我将尝试一下

Python错误:

_csv.Error: new-line character seen in unquoted field

试图读取Macintosh(OS X之前的格式)的CSV文件引起的。这些是使用CR作为行尾的文本文件。如果使用MS Office,请确保选择纯CSV格式或CSV(MS-DOS)不要使用CSV(Macintosh)作为另存为类型。

我首选的EOL版本是LF(Unix / Linux / Apple),但我不认为MS Office提供了以这种格式保存的选项。

I realize this is an old post, but I ran into the same problem and don’t see the correct answer so I will give it a try

Python Error:

_csv.Error: new-line character seen in unquoted field

Caused by trying to read Macintosh (pre OS X formatted) CSV files. These are text files that use CR for end of line. If using MS Office make sure you select either plain CSV format or CSV (MS-DOS). Do not use CSV (Macintosh) as save-as type.

My preferred EOL version would be LF (Unix/Linux/Apple), but I don’t think MS Office provides the option to save in this format.


回答 2

对于Mac OS X,请以“ Windows逗号分隔(.csv)”格式保存CSV文件。

For Mac OS X, save your CSV file in “Windows Comma Separated (.csv)” format.


回答 3

如果您在Mac上遇到了这种情况(就像对我一样):

  1. 将文件另存为 CSV (MS-DOS Comma-Separated)
  2. 运行以下脚本

    with open(csv_filename, 'rU') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            print ', '.join(row)
    

If this happens to you on mac (as it did to me):

  1. Save the file as CSV (MS-DOS Comma-Separated)
  2. Run the following script

    with open(csv_filename, 'rU') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            print ', '.join(row)
    

回答 4

尝试先dos2unix在Windows导入的文件上运行

Try to run dos2unix on your windows imported files first


回答 5

这是我遇到的错误。我已将.csv文件保存在MAC OSX中。

保存时,将其另存为“ Windows逗号分隔值(.csv)”,此问题已解决。

This is an error that I faced. I had saved .csv file in MAC OSX.

While saving, save it as “Windows Comma Separated Values (.csv)” which resolved the issue.


回答 6

这在OSX上对我有用。

# allow variable to opened as files
from io import StringIO

# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode

# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
    uncleansedBytes = fID.read()
    # decode the file using the correct encoding scheme
    # (probably this old windows one) 
    uncleansedText = uncleansedBytes.decode('Windows-1252')

    # replace carriage-returns with new-lines
    cleansedText = uncleansedText.replace('\r', '\n')

    # map any other non UTF-8 characters into UTF-8
    asciiText = unidecode(cleansedText)

# read each line of the csv file and store as an array of dicts, 
# use first line as field names for each dict. 
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
    # do something with your read data 

This worked for me on OSX.

# allow variable to opened as files
from io import StringIO

# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode

# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
    uncleansedBytes = fID.read()
    # decode the file using the correct encoding scheme
    # (probably this old windows one) 
    uncleansedText = uncleansedBytes.decode('Windows-1252')

    # replace carriage-returns with new-lines
    cleansedText = uncleansedText.replace('\r', '\n')

    # map any other non UTF-8 characters into UTF-8
    asciiText = unidecode(cleansedText)

# read each line of the csv file and store as an array of dicts, 
# use first line as field names for each dict. 
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
    # do something with your read data 

回答 7

我知道这个问题已经回答了很长时间,但并不能解决我的问题。由于其他一些复杂性,我正在使用DictReader和StringIO进行csv读取。通过显式替换定界符,我能够更简单地解决问题:

with urllib.request.urlopen(q) as response:
    raw_data = response.read()
    encoding = response.info().get_content_charset('utf8') 
    data = raw_data.decode(encoding)
    if '\r\n' not in data:
        # proably a windows delimited thing...try to update it
        data = data.replace('\r', '\r\n')

对于庞大的CSV文件来说可能并不合理,但对于我的用例来说效果很好。

I know this has been answered for quite some time but not solve my problem. I am using DictReader and StringIO for my csv reading due to some other complications. I was able to solve problem more simply by replacing delimiters explicitly:

with urllib.request.urlopen(q) as response:
    raw_data = response.read()
    encoding = response.info().get_content_charset('utf8') 
    data = raw_data.decode(encoding)
    if '\r\n' not in data:
        # proably a windows delimited thing...try to update it
        data = data.replace('\r', '\r\n')

Might not be reasonable for enormous CSV files, but worked well for my use case.


回答 8

替代快速解决方案:我遇到了同样的错误。我在lubuntu机器上的GNUMERIC中重新打开了“奇怪的” csv文件,并将该文件导出为csv文件。这解决了问题。

Alternative and fast solution : I faced the same error. I reopened the “wierd” csv file in GNUMERIC on my lubuntu machine and exported the file as csv file. This corrected the issue.


如何将数据作为字符串(而非文件)写入CSV格式?

问题:如何将数据作为字符串(而非文件)写入CSV格式?

我想将数据[1,2,'a','He said "what do you mean?"']转换为CSV格式的字符串。

通常会用到csv.writer()它,因为它处理所有疯狂的情况(逗号转义,引号转义,CSV方言等)。捕获的结果是csv.writer()期望输出到文件对象,而不是字符串。

我当前的解决方案是此功能有点怪异:

def CSV_String_Writeline(data):
    class Dummy_Writer:
        def write(self,instring):
            self.outstring = instring.strip("\r\n")
    dw = Dummy_Writer()
    csv_w = csv.writer( dw )
    csv_w.writerow(data)
    return dw.outstring

谁能提供一种仍然可以很好地处理边缘情况的更优雅的解决方案?

编辑:这是我最终完成的方式:

def csv2string(data):
    si = StringIO.StringIO()
    cw = csv.writer(si)
    cw.writerow(data)
    return si.getvalue().strip('\r\n')

I want to cast data like [1,2,'a','He said "what do you mean?"'] to a CSV-formatted string.

Normally one would use csv.writer() for this, because it handles all the crazy edge cases (comma escaping, quote mark escaping, CSV dialects, etc.) The catch is that csv.writer() expects to output to a file object, not to a string.

My current solution is this somewhat hacky function:

def CSV_String_Writeline(data):
    class Dummy_Writer:
        def write(self,instring):
            self.outstring = instring.strip("\r\n")
    dw = Dummy_Writer()
    csv_w = csv.writer( dw )
    csv_w.writerow(data)
    return dw.outstring

Can anyone give a more elegant solution that still handles the edge cases well?

Edit: Here’s how I ended up doing it:

def csv2string(data):
    si = StringIO.StringIO()
    cw = csv.writer(si)
    cw.writerow(data)
    return si.getvalue().strip('\r\n')

回答 0

您可以使用StringIO而不是自己的Dummy_Writer

此模块实现了类似文件的类,该类StringIO读写字符串缓冲区(也称为内存文件)。

还有cStringIO,这是StringIO该类的更快版本。

You could use StringIO instead of your own Dummy_Writer:

This module implements a file-like class, StringIO, that reads and writes a string buffer (also known as memory files).

There is also cStringIO, which is a faster version of the StringIO class.


回答 1

在Python 3中:

>>> import io
>>> import csv
>>> output = io.StringIO()
>>> csvdata = [1,2,'a','He said "what do you mean?"',"Whoa!\nNewlines!"]
>>> writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
>>> writer.writerow(csvdata)
59
>>> output.getvalue()
'1,2,"a","He said ""what do you mean?""","Whoa!\nNewlines!"\r\n'

对于Python 2,需要更改一些细节:

>>> output = io.BytesIO()
>>> writer = csv.writer(output)
>>> writer.writerow(csvdata)
57L
>>> output.getvalue()
'1,2,a,"He said ""what do you mean?""","Whoa!\nNewlines!"\r\n'

In Python 3:

>>> import io
>>> import csv
>>> output = io.StringIO()
>>> csvdata = [1,2,'a','He said "what do you mean?"',"Whoa!\nNewlines!"]
>>> writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
>>> writer.writerow(csvdata)
59
>>> output.getvalue()
'1,2,"a","He said ""what do you mean?""","Whoa!\nNewlines!"\r\n'

Some details need to be changed a bit for Python 2:

>>> output = io.BytesIO()
>>> writer = csv.writer(output)
>>> writer.writerow(csvdata)
57L
>>> output.getvalue()
'1,2,a,"He said ""what do you mean?""","Whoa!\nNewlines!"\r\n'

回答 2

我发现答案总的来说有点令人困惑。对于Python 2,这种用法对我有用:

import csv, io

def csv2string(data):
    si = io.BytesIO()
    cw = csv.writer(si)
    cw.writerow(data)
    return si.getvalue().strip('\r\n')

data=[1,2,'a','He said "what do you mean?"']
print csv2string(data)

I found the answers, all in all, a bit confusing. For Python 2, this usage worked for me:

import csv, io

def csv2string(data):
    si = io.BytesIO()
    cw = csv.writer(si)
    cw.writerow(data)
    return si.getvalue().strip('\r\n')

data=[1,2,'a','He said "what do you mean?"']
print csv2string(data)

回答 3

由于我大量使用此代码将结果从sanic作为csv数据异步流回用户,因此我为Python 3编写了以下代码段。

该代码段可让您一次又一次地重复使用相同的StringIo缓冲区。


import csv
from io import StringIO


class ArgsToCsv:
    def __init__(self, seperator=","):
        self.seperator = seperator
        self.buffer = StringIO()
        self.writer = csv.writer(self.buffer)

    def stringify(self, *args):
        self.writer.writerow(args)
        value = self.buffer.getvalue().strip("\r\n")
        self.buffer.seek(0)
        self.buffer.truncate(0)
        return value + "\n"

例:

csv_formatter = ArgsToCsv()

output += csv_formatter.stringify(
    10,
    """
    lol i have some pretty
    "freaky"
    strings right here \' yo!
    """,
    [10, 20, 30],
)

在github gist上查看更多用法:源代码和测试

since i use this quite a lot to stream results asynchronously from sanic back to the user as csv data i wrote the following snippet for Python 3.

The snippet lets you reuse the same StringIo buffer over and over again.


import csv
from io import StringIO


class ArgsToCsv:
    def __init__(self, seperator=","):
        self.seperator = seperator
        self.buffer = StringIO()
        self.writer = csv.writer(self.buffer)

    def stringify(self, *args):
        self.writer.writerow(args)
        value = self.buffer.getvalue().strip("\r\n")
        self.buffer.seek(0)
        self.buffer.truncate(0)
        return value + "\n"

example:

csv_formatter = ArgsToCsv()

output += csv_formatter.stringify(
    10,
    """
    lol i have some pretty
    "freaky"
    strings right here \' yo!
    """,
    [10, 20, 30],
)

Check out further usage at the github gist: source and test


回答 4

import csv
from StringIO import StringIO
with open('file.csv') as file:
    file = file.read()

stream = StringIO(file)

csv_file = csv.DictReader(stream)
import csv
from StringIO import StringIO
with open('file.csv') as file:
    file = file.read()

stream = StringIO(file)

csv_file = csv.DictReader(stream)

回答 5

这是适用于utf-8的版本。csvline2string仅用于一行,末尾没有换行符,csv2string用于多行,具有换行符:

import csv, io

def csvline2string(one_line_of_data):
    si = BytesIO.StringIO()
    cw = csv.writer(si)
    cw.writerow(one_line_of_data)
    return si.getvalue().strip('\r\n')

def csv2string(data):
    si = BytesIO.StringIO()
    cw = csv.writer(si)
    for one_line_of_data in data:
        cw.writerow(one_line_of_data)
    return si.getvalue()

Here’s the version that works for utf-8. csvline2string for just one line, without linebreaks at the end, csv2string for many lines, with linebreaks:

import csv, io

def csvline2string(one_line_of_data):
    si = BytesIO.StringIO()
    cw = csv.writer(si)
    cw.writerow(one_line_of_data)
    return si.getvalue().strip('\r\n')

def csv2string(data):
    si = BytesIO.StringIO()
    cw = csv.writer(si)
    for one_line_of_data in data:
        cw.writerow(one_line_of_data)
    return si.getvalue()

如何使用csv.DictWriter编写标题行?

问题:如何使用csv.DictWriter编写标题行?

假设我有一个csv.DictReader对象,并且想将其写为CSV文件。我怎样才能做到这一点?

我知道我可以这样写数据行

dr = csv.DictReader(open(f), delimiter='\t')
# process my dr object
# ...
# write out object
output = csv.DictWriter(open(f2, 'w'), delimiter='\t')
for item in dr:
    output.writerow(item)

但是,如何包含字段名?

Assume I have a csv.DictReader object and I want to write it out as a CSV file. How can I do this?

I know that I can write the rows of data like this:

dr = csv.DictReader(open(f), delimiter='\t')
# process my dr object
# ...
# write out object
output = csv.DictWriter(open(f2, 'w'), delimiter='\t')
for item in dr:
    output.writerow(item)

But how can I include the fieldnames?


回答 0

编辑:
在2.7 / 3.2中,有一个新writeheader()方法。同样,John Machin的答案提供了一种更简单的写标题行的方法。现在使用2.7 / 3.2中提供
writeheader()方法的简单示例:

from collections import OrderedDict
ordered_fieldnames = OrderedDict([('field1',None),('field2',None)])
with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=ordered_fieldnames)
    dw.writeheader()
    # continue on to write data

实例化DictWriter需要一个fieldnames参数。
文档中

fieldnames参数标识传递到writerow()方法的字典中的值写入csvfile的顺序。

换句话说,Fieldnames参数是必需的,因为Python字典本质上是无序的。
以下是如何将标头和数据写入文件的示例。
注意:with声明是在2.6中添加的。如果使用2.5:from __future__ import with_statement

with open(infile,'rb') as fin:
    dr = csv.DictReader(fin, delimiter='\t')

# dr.fieldnames contains values from first row of `f`.
with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=dr.fieldnames)
    headers = {} 
    for n in dw.fieldnames:
        headers[n] = n
    dw.writerow(headers)
    for row in dr:
        dw.writerow(row)

正如@FM在评论中提到的,您可以将标头编写压缩为单行代码,例如:

with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=dr.fieldnames)
    dw.writerow(dict((fn,fn) for fn in dr.fieldnames))
    for row in dr:
        dw.writerow(row)

Edit:
In 2.7 / 3.2 there is a new writeheader() method. Also, John Machin’s answer provides a simpler method of writing the header row.
Simple example of using the writeheader() method now available in 2.7 / 3.2:

from collections import OrderedDict
ordered_fieldnames = OrderedDict([('field1',None),('field2',None)])
with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=ordered_fieldnames)
    dw.writeheader()
    # continue on to write data

Instantiating DictWriter requires a fieldnames argument.
From the documentation:

The fieldnames parameter identifies the order in which values in the dictionary passed to the writerow() method are written to the csvfile.

Put another way: The Fieldnames argument is required because Python dicts are inherently unordered.
Below is an example of how you’d write the header and data to a file.
Note: with statement was added in 2.6. If using 2.5: from __future__ import with_statement

with open(infile,'rb') as fin:
    dr = csv.DictReader(fin, delimiter='\t')

# dr.fieldnames contains values from first row of `f`.
with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=dr.fieldnames)
    headers = {} 
    for n in dw.fieldnames:
        headers[n] = n
    dw.writerow(headers)
    for row in dr:
        dw.writerow(row)

As @FM mentions in a comment, you can condense header-writing to a one-liner, e.g.:

with open(outfile,'wb') as fou:
    dw = csv.DictWriter(fou, delimiter='\t', fieldnames=dr.fieldnames)
    dw.writerow(dict((fn,fn) for fn in dr.fieldnames))
    for row in dr:
        dw.writerow(row)

回答 1

一些选择:

(1)费力地在您的字段名称中做出一个身份映射(即不做任何事)命令,以便csv.DictWriter可以将其转换回列表并将其传递给csv.writer实例。

(2)文档中提到了“基础writer实例”……所以就用它(最后是示例)。

dw.writer.writerow(dw.fieldnames)

(3)避免csv.Dictwriter产生开销,并由csv.writer自己完成

写入数据:

w.writerow([d[k] for k in fieldnames])

要么

w.writerow([d.get(k, restval) for k in fieldnames])

除了extrasaction“功能”之外,我更愿意自己编写代码。这样,您可以报告所有带有键和值的“附加”,而不仅仅是第一个额外的键。DictWriter真正令人讨厌的是,如果您在构建每个dict时亲自验证了密钥,则需要记住使用extrasaction =’ignore’,否则它将缓慢进行(字段名是列表),重复检查:

wrong_fields = [k for k in rowdict if k not in self.fieldnames]

============

>>> f = open('csvtest.csv', 'wb')
>>> import csv
>>> fns = 'foo bar zot'.split()
>>> dw = csv.DictWriter(f, fns, restval='Huh?')
# dw.writefieldnames(fns) -- no such animal
>>> dw.writerow(fns) # no such luck, it can't imagine what to do with a list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python26\lib\csv.py", line 144, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "C:\python26\lib\csv.py", line 141, in _dict_to_list
    return [rowdict.get(key, self.restval) for key in self.fieldnames]
AttributeError: 'list' object has no attribute 'get'
>>> dir(dw)
['__doc__', '__init__', '__module__', '_dict_to_list', 'extrasaction', 'fieldnam
es', 'restval', 'writer', 'writerow', 'writerows']
# eureka
>>> dw.writer.writerow(dw.fieldnames)
>>> dw.writerow({'foo':'oof'})
>>> f.close()
>>> open('csvtest.csv', 'rb').read()
'foo,bar,zot\r\noof,Huh?,Huh?\r\n'
>>>

A few options:

(1) Laboriously make an identity-mapping (i.e. do-nothing) dict out of your fieldnames so that csv.DictWriter can convert it back to a list and pass it to a csv.writer instance.

(2) The documentation mentions “the underlying writer instance” … so just use it (example at the end).

dw.writer.writerow(dw.fieldnames)

(3) Avoid the csv.Dictwriter overhead and do it yourself with csv.writer

Writing data:

w.writerow([d[k] for k in fieldnames])

or

w.writerow([d.get(k, restval) for k in fieldnames])

Instead of the extrasaction “functionality”, I’d prefer to code it myself; that way you can report ALL “extras” with the keys and values, not just the first extra key. What is a real nuisance with DictWriter is that if you’ve verified the keys yourself as each dict was being built, you need to remember to use extrasaction=’ignore’ otherwise it’s going to SLOWLY (fieldnames is a list) repeat the check:

wrong_fields = [k for k in rowdict if k not in self.fieldnames]

============

>>> f = open('csvtest.csv', 'wb')
>>> import csv
>>> fns = 'foo bar zot'.split()
>>> dw = csv.DictWriter(f, fns, restval='Huh?')
# dw.writefieldnames(fns) -- no such animal
>>> dw.writerow(fns) # no such luck, it can't imagine what to do with a list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python26\lib\csv.py", line 144, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "C:\python26\lib\csv.py", line 141, in _dict_to_list
    return [rowdict.get(key, self.restval) for key in self.fieldnames]
AttributeError: 'list' object has no attribute 'get'
>>> dir(dw)
['__doc__', '__init__', '__module__', '_dict_to_list', 'extrasaction', 'fieldnam
es', 'restval', 'writer', 'writerow', 'writerows']
# eureka
>>> dw.writer.writerow(dw.fieldnames)
>>> dw.writerow({'foo':'oof'})
>>> f.close()
>>> open('csvtest.csv', 'rb').read()
'foo,bar,zot\r\noof,Huh?,Huh?\r\n'
>>>

回答 2

另一种方法是在输出中添加以下行之前添加:

output.writerow(dict(zip(dr.fieldnames, dr.fieldnames)))

zip将返回包含相同值的doublet列表。此列表可用于启动字典。

Another way to do this would be to add before adding lines in your output, the following line :

output.writerow(dict(zip(dr.fieldnames, dr.fieldnames)))

The zip would return a list of doublet containing the same value. This list could be used to initiate a dictionary.


处理CSV数据时如何忽略第一行数据?

问题:处理CSV数据时如何忽略第一行数据?

我要Python从一列CSV数据中打印最少的数字,但是第一行是列号,我不希望Python考虑到第一行。如何确定Python忽略第一行?

到目前为止,这是代码:

import csv

with open('all16.csv', 'rb') as inf:
    incsv = csv.reader(inf)
    column = 1                
    datatype = float          
    data = (datatype(column) for row in incsv)   
    least_value = min(data)

print least_value

您还能说明自己在做什么,而不仅仅是给出代码吗?我对Python非常陌生,并希望确保我了解所有内容。

I am asking Python to print the minimum number from a column of CSV data, but the top row is the column number, and I don’t want Python to take the top row into account. How can I make sure Python ignores the first line?

This is the code so far:

import csv

with open('all16.csv', 'rb') as inf:
    incsv = csv.reader(inf)
    column = 1                
    datatype = float          
    data = (datatype(column) for row in incsv)   
    least_value = min(data)

print least_value

Could you also explain what you are doing, not just give the code? I am very very new to Python and would like to make sure I understand everything.


回答 0

您可以使用csv模块Sniffer类的实例来推断CSV文件的格式,并检测是否存在标头行以及next()仅在必要时才跳过第一行的内置函数:

import csv

with open('all16.csv', 'r', newline='') as file:
    has_header = csv.Sniffer().has_header(file.read(1024))
    file.seek(0)  # Rewind.
    reader = csv.reader(file)
    if has_header:
        next(reader)  # Skip header row.
    column = 1
    datatype = float
    data = (datatype(row[column]) for row in reader)
    least_value = min(data)

print(least_value)

由于在您的示例中datatypecolumn都进行了硬编码,因此row像这样处理起来会更快一些:

    data = (float(row[1]) for row in reader)

注意:以上代码适用于Python3.x。对于Python 2.x,使用以下行来打开文件而不是显示的内容:

with open('all16.csv', 'rb') as file:

You could use an instance of the csv module’s Sniffer class to deduce the format of a CSV file and detect whether a header row is present along with the built-in next() function to skip over the first row only when necessary:

import csv

with open('all16.csv', 'r', newline='') as file:
    has_header = csv.Sniffer().has_header(file.read(1024))
    file.seek(0)  # Rewind.
    reader = csv.reader(file)
    if has_header:
        next(reader)  # Skip header row.
    column = 1
    datatype = float
    data = (datatype(row[column]) for row in reader)
    least_value = min(data)

print(least_value)

Since datatype and column are hardcoded in your example, it would be slightly faster to process the row like this:

    data = (float(row[1]) for row in reader)

Note: the code above is for Python 3.x. For Python 2.x use the following line to open the file instead of what is shown:

with open('all16.csv', 'rb') as file:

回答 1

要跳过第一行,只需调用:

next(inf)

Python中的文件是行上的迭代器。

To skip the first line just call:

next(inf)

Files in Python are iterators over lines.


回答 2

在类似的用例中,我不得不在具有实际列名的行之前跳过烦人的行。该解决方案效果很好。首先阅读文件,然后将列表传递给csv.DictReader

with open('all16.csv') as tmp:
    # Skip first line (if any)
    next(tmp, None)

    # {line_num: row}
    data = dict(enumerate(csv.DictReader(tmp)))

In a similar use case I had to skip annoying lines before the line with my actual column names. This solution worked nicely. Read the file first, then pass the list to csv.DictReader.

with open('all16.csv') as tmp:
    # Skip first line (if any)
    next(tmp, None)

    # {line_num: row}
    data = dict(enumerate(csv.DictReader(tmp)))

回答 3

python cookbook借来的,
更简洁的模板代码可能如下所示:

import csv
with open('stocks.csv') as f:
    f_csv = csv.reader(f) 
    headers = next(f_csv) 
    for row in f_csv:
        # Process row ...

Borrowed from python cookbook,
A more concise template code might look like this:

import csv
with open('stocks.csv') as f:
    f_csv = csv.reader(f) 
    headers = next(f_csv) 
    for row in f_csv:
        # Process row ...

回答 4

通常next(incsv),您会使用它使迭代器前进一排,因此跳过标题。另一个(例如,您想跳过30行)将是:

from itertools import islice
for row in islice(incsv, 30, None):
    # process

You would normally use next(incsv) which advances the iterator one row, so you skip the header. The other (say you wanted to skip 30 rows) would be:

from itertools import islice
for row in islice(incsv, 30, None):
    # process

回答 5

使用csv.DictReader而不是csv.Reader。如果省略fieldnames参数,则csvfile第一行中的值将用作字段名称。这样便可以使用row [“ 1”]等访问字段值

use csv.DictReader instead of csv.Reader. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as field names. you would then be able to access field values using row[“1”] etc


回答 6

新的“ pandas”软件包可能比“ csv”更相关。下面的代码将读取一个CSV文件,默认情况下将第一行解释为列标题,并在各列中查找最小值。

import pandas as pd

data = pd.read_csv('all16.csv')
data.min()

The new ‘pandas’ package might be more relevant than ‘csv’. The code below will read a CSV file, by default interpreting the first line as the column header and find the minimum across columns.

import pandas as pd

data = pd.read_csv('all16.csv')
data.min()

回答 7

好吧,我的迷你包装库也可以完成这项工作。

>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])

同时,如果您知道什么是标题列索引之一,例如“ Column 1”,则可以执行以下操作:

>>> min(data.column["Column 1"])

Well, my mini wrapper library would do the job as well.

>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])

Meanwhile, if you know what header column index one is, for example “Column 1”, you can do this instead:

>>> min(data.column["Column 1"])

回答 8

对我来说,最简单的方法就是使用范围。

import csv

with open('files/filename.csv') as I:
    reader = csv.reader(I)
    fulllist = list(reader)

# Starting with data skipping header
for item in range(1, len(fulllist)): 
    # Print each row using "item" as the index value
    print (fulllist[item])  

For me the easiest way to go is to use range.

import csv

with open('files/filename.csv') as I:
    reader = csv.reader(I)
    fulllist = list(reader)

# Starting with data skipping header
for item in range(1, len(fulllist)): 
    # Print each row using "item" as the index value
    print (fulllist[item])  

回答 9

因为这与我正在做的事情有关,所以我在这里分享。

如果我们不确定是否有标题并且您又不想导入嗅探器和其他内容,该怎么办?

如果您的任务是基本任务,例如打印或追加到列表或数组,则可以使用if语句:

# Let's say there's 4 columns
with open('file.csv') as csvfile:
     csvreader = csv.reader(csvfile)
# read first line
     first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
     if len(first_line) == 4:
          array.append(first_line)
# Now we'll just iterate over everything else as usual:
     for row in csvreader:
          array.append(row)

Because this is related to something I was doing, I’ll share here.

What if we’re not sure if there’s a header and you also don’t feel like importing sniffer and other things?

If your task is basic, such as printing or appending to a list or array, you could just use an if statement:

# Let's say there's 4 columns
with open('file.csv') as csvfile:
     csvreader = csv.reader(csvfile)
# read first line
     first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
     if len(first_line) == 4:
          array.append(first_line)
# Now we'll just iterate over everything else as usual:
     for row in csvreader:
          array.append(row)

回答 10

Python 3 CSV模块文档提供了以下示例:

with open('example.csv', newline='') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

Sniffer会尝试自动检测有关CSV文件很多东西。您需要显式调用其has_header()方法以确定文件是否具有标题行。如果是这样,则在循环CSV行时跳过第一行。您可以这样做:

if sniffer.has_header():
    for header_row in reader:
        break
for data_row in reader:
    # do something with the row

The documentation for the Python 3 CSV module provides this example:

with open('example.csv', newline='') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

The Sniffer will try to auto-detect many things about the CSV file. You need to explicitly call its has_header() method to determine whether the file has a header line. If it does, then skip the first row when iterating the CSV rows. You can do it like this:

if sniffer.has_header():
    for header_row in reader:
        break
for data_row in reader:
    # do something with the row

回答 11

我将使用tail摆脱不必要的第一行:

tail -n +2 $INFIL | whatever_script.py 

I would use tail to get rid of the unwanted first line:

tail -n +2 $INFIL | whatever_script.py 

回答 12

只需添加[1:]

下面的例子:

data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**

在iPython中对我有用

just add [1:]

example below:

data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**

that works for me in iPython


回答 13

的Python 3.X

处理UTF8 BOM + HEADER

令人沮丧的是,csv模块无法轻松获取标头,UTF-8 BOM(文件中的第一个字符)也存在一个错误。这仅适用于我的csv模块:

import csv

def read_csv(self, csv_path, delimiter):
    with open(csv_path, newline='', encoding='utf-8') as f:
        # https://bugs.python.org/issue7185
        # Remove UTF8 BOM.
        txt = f.read()[1:]

    # Remove header line.
    header = txt.splitlines()[:1]
    lines = txt.splitlines()[1:]

    # Convert to list.
    csv_rows = list(csv.reader(lines, delimiter=delimiter))

    for row in csv_rows:
        value = row[INDEX_HERE]

Python 3.X

Handles UTF8 BOM + HEADER

It was quite frustrating that the csv module could not easily get the header, there is also a bug with the UTF-8 BOM (first char in file). This works for me using only the csv module:

import csv

def read_csv(self, csv_path, delimiter):
    with open(csv_path, newline='', encoding='utf-8') as f:
        # https://bugs.python.org/issue7185
        # Remove UTF8 BOM.
        txt = f.read()[1:]

    # Remove header line.
    header = txt.splitlines()[:1]
    lines = txt.splitlines()[1:]

    # Convert to list.
    csv_rows = list(csv.reader(lines, delimiter=delimiter))

    for row in csv_rows:
        value = row[INDEX_HERE]

回答 14

我将csvreader转换为list,然后弹出第一个元素

import csv        

with open(fileName, 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        data = list(csvreader)               # Convert to list
        data.pop(0)                          # Removes the first row

        for row in data:
            print(row)

I would convert csvreader to list, then pop the first element

import csv        

with open(fileName, 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        data = list(csvreader)               # Convert to list
        data.pop(0)                          # Removes the first row

        for row in data:
            print(row)

回答 15

Python 2.x

csvreader.next()

将读者可迭代对象的下一行作为列表返回,并根据当前方言进行解析。

csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
    print(row) # should print second row

Python 3.x

csvreader.__next__()

返回读取器的可迭代对象的下一行作为列表(如果该对象是从reader()返回的)或字典(如果它是DictReader实例),则根据当前的方言进行解析。通常,您应该将此称为next(reader)。

csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
    print(row) # should print second row

Python 2.x

csvreader.next()

Return the next row of the reader’s iterable object as a list, parsed according to the current dialect.

csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
    print(row) # should print second row

Python 3.x

csvreader.__next__()

Return the next row of the reader’s iterable object as a list (if the object was returned from reader()) or a dict (if it is a DictReader instance), parsed according to the current dialect. Usually you should call this as next(reader).

csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
    print(row) # should print second row

为什么csvwriter.writerow()在每个字符后加逗号?

问题:为什么csvwriter.writerow()在每个字符后加逗号?

此代码打开url并/names在末尾附加,然后打开页面并将字符串打印到test1.csv

import urllib2
import re
import csv

url = ("http://www.example.com")
bios = [u'/name1', u'/name2', u'/name3']
csvwriter = csv.writer(open("/test1.csv", "a"))

for l in bios:
    OpenThisLink = url + l
    response = urllib2.urlopen(OpenThisLink)
    html = response.read()
    item = re.search('(JD)(.*?)(\d+)', html)
    if item:
        JD = item.group()
        csvwriter.writerow(JD)
    else:
        NoJD = "NoJD"
        csvwriter.writerow(NoJD)

但是我得到这个结果:

J,D,",", ,C,o,l,u,m,b,i,a, ,L,a,w, ,S,c,h,o,o,l,....

如果我将字符串更改为(“ JD”,“哥伦比亚法学院” ….),则会得到

JD, Columbia Law School...)

我在文档中找不到如何指定分度符。

如果我尝试使用delimenter,则会出现此错误:

TypeError: 'delimeter' is an invalid keyword argument for this function

谢谢您的帮助。

This code opens the url and appends the /names at the end and opens the page and prints the string to test1.csv:

import urllib2
import re
import csv

url = ("http://www.example.com")
bios = [u'/name1', u'/name2', u'/name3']
csvwriter = csv.writer(open("/test1.csv", "a"))

for l in bios:
    OpenThisLink = url + l
    response = urllib2.urlopen(OpenThisLink)
    html = response.read()
    item = re.search('(JD)(.*?)(\d+)', html)
    if item:
        JD = item.group()
        csvwriter.writerow(JD)
    else:
        NoJD = "NoJD"
        csvwriter.writerow(NoJD)

But I get this result:

J,D,",", ,C,o,l,u,m,b,i,a, ,L,a,w, ,S,c,h,o,o,l,....

If I change the string to (“JD”, “Columbia Law School” ….) then I get

JD, Columbia Law School...)

I couldn’t find in the documentation how to specify the delimeter.

If I try to use delimenter I get this error:

TypeError: 'delimeter' is an invalid keyword argument for this function

Thanks for the help.


回答 0

它需要一个字符串序列(例如:列表或元组)。您给它一个字符串。一个字符串也恰好是一个字符串序列,但是它是一个由1个字符串组成的序列,这不是您想要的。

如果您只希望每行一个字符串,则可以执行以下操作:

csvwriter.writerow([JD])

这会用列表包装JD(字符串)。

It expects a sequence (eg: a list or tuple) of strings. You’re giving it a single string. A string happens to be a sequence of strings too, but it’s a sequence of 1 character strings, which isn’t what you want.

If you just want one string per row you could do something like this:

csvwriter.writerow([JD])

This wraps JD (a string) with a list.


回答 1

csv.writer类将一个可迭代的参数作为writerow的参数。由于Python中的字符串可以按字符进行迭代,因此它们是writerow可接受的参数,但是您会得到上面的输出。

为了解决这个问题,您可以根据空格分割值(我假设这就是您想要的)

csvwriter.writerow(JD.split())

The csv.writer class takes an iterable as it’s argument to writerow; as strings in Python are iterable by character, they are an acceptable argument to writerow, but you get the above output.

To correct this, you could split the value based on whitespace (I’m assuming that’s what you want)

csvwriter.writerow(JD.split())

回答 2

发生这种情况的原因是,当MatchObject实例的group()方法仅返回单个值时,它将作为字符串返回。当有多个值时,它们将作为字符串元组返回。

如果您要写一行,我想csv.writer会遍历传递给它的对象。如果传递单个字符串(可迭代),则会对其字符进行迭代,从而产生您正在观察的结果。如果传递字符串的元组,它将获得实际的字符串,而不是每次迭代都包含单个字符。

This happens, because when group() method of a MatchObject instance returns only a single value, it returns it as a string. When there are multiple values, they are returned as a tuple of strings.

If you are writing a row, I guess, csv.writer iterates over the object you pass to it. If you pass a single string (which is an iterable), it iterates over its characters, producing the result you are observing. If you pass a tuple of strings, it gets an actual string, not a single character on every iteration.