标签归档:csv

熊猫read_csv并使用usecols过滤列

问题:熊猫read_csv并使用usecols过滤列

我有一个csv文件,pandas.read_csv当我使用过滤列usecols并使用多个索引时,该文件输入不正确。

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

我希望df1和df2除了丢失的虚拟列外应该相同,但是这些列的标签错误。日期也被解析为日期。

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

使用列号而不是名称给我同样的问题。我可以通过在read_csv步骤之后删除虚拟列来解决此问题,但是我试图了解出了什么问题。我正在使用熊猫0.10.1。

编辑:修复错误的标头用法。

I have a csv file which isn’t coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I’m trying to understand what is going wrong. I’m using pandas 0.10.1.

edit: fixed bad header usage.


回答 0

@chip的答案完全错过了两个关键字参数的含义。

  • 名称是只在必要时有没有头和要指定使用的列名,而不是整数索引等参数。
  • usecols应该在将整个DataFrame读入内存之前提供过滤器;如果使用得当,则读取后永远不需要删除列。

此解决方案纠正了这些怪异现象:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

这给了我们:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

The answer by @chip completely misses the point of two keyword arguments.

  • names is only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
  • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

This solution corrects those oddities:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答 1

这段代码实现了您想要的—也很奇怪,甚至有很多错误:

我观察到它在以下情况下有效:

a)您指定index_col相关。到您实际使用的列数-因此在此示例中为三列,而不是四列(dummy从此开始滴下并开始计数)

b)相同 parse_dates

c)并非usecols出于明显原因;)

d)在这里我改编了names以反映此行为

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

哪个打印

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

This code achieves what you want — also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use — so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答 2

如果您的csv文件包含额外的数据,则可以在导入后从DataFrame中删除列。

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

这给了我们:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答 3

您只需添加index_col=False参数

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     index_col=["date", "loc"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1

You have to just add the index_col=False parameter

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     index_col=["date", "loc"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1

回答 4

首先导入csv并使用csv.DictReader其易于处理…

import csv first and use csv.DictReader its easy to process…


CSV导入熊猫时跳过行

问题:CSV导入熊猫时跳过行

我正在尝试使用导入.csv文件pandas.read_csv(),但是我不想导入数据文件的第二行(索引为0的索引为1的行)。

我看不到如何不导入它,因为与命令一起使用的参数似乎模棱两可:

从熊猫网站:

skiprows :类列表或整数

文件开头要跳过的行号(索引为0)或要跳过的行数(整数)。”

如果输入skiprows=1参数,它如何知道是跳过第一行还是跳过索引为1的行?

I’m trying to import a .csv file using pandas.read_csv(), however I don’t want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).

I can’t see how not to import it because the arguments used with the command seem ambiguous:

From the pandas website:

skiprows : list-like or integer

Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file.”

If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?


回答 0

您可以尝试:

>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
   0  1
0  1  2
1  5  6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
   0  1
0  3  4
1  5  6

You can try yourself:

>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
   0  1
0  1  2
1  5  6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
   0  1
0  3  4
1  5  6

回答 1

我尚无任何评论的信誉,但我想添加到alko答案以供进一步参考。

文档

skiprows:文件中要跳过的行的数字集合。也可以是整数以跳过前n行

I don’t have reputation to comment yet, but I want to add to alko answer for further reference.

From the docs:

skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows


回答 2

在读取csv文件时运行行列时遇到相同的问题。我当时在做skip_rows = 1这行不通

一个简单的示例给出了一个在读取csv文件时如何使用跳栏的想法。

import pandas as pd

#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1)  ## pandas as pd

#print the data frame
df

I got the same issue while running the skiprows while reading the csv file. I was doning skip_rows=1 this will not work

Simple example gives an idea how to use skiprows while reading csv file.

import pandas as pd

#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1)  ## pandas as pd

#print the data frame
df

回答 3

所有这些答案都遗漏了一个重要点-第n行是文件中的第n行,而不是数据集中的第n行。我遇到从USGS下载一些过时的流量表数据的情况。数据集的开头用“#”注释,其后的第一行是标签,下一行是描述日期类型的行,最后是数据本身。我不知道有多少条注释行,但是我知道前几行是什么。例:

– – – – – – – – – – – – – – – 警告 – – – – – – – – – – ————–

您从此美国地质调查局数据库中获得的一些数据

可能尚未获得董事的批准。… agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd

5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A

如果有一种方法可以自动跳过第n行和第n行,那就太好了。

作为说明,我能够通过以下方式解决问题:

import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

All of these answers miss one important point — the n’th line is the n’th line in the file, and not the n’th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with ‘#’, the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:

—————————– WARNING ———————————-

Some of the data that you have obtained from this U.S. Geological Survey database

may not have received Director’s approval. … agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd

5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A

It would be nice if there was a way to automatically skip the n’th row as well as the n’th line.

As a note, I was able to fix my issue with:

import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

回答 4

skip[1] 将跳过第二行,而不是第一行。

skip[1] will skip second line, not the first one.


回答 5

另外,请确保您的文件实际上是CSV文件。例如,如果您有一个.xls文件,并且只是将文件扩展名更改为.csv,则该文件将不会导入,并且会出现上述错误。要检查是否是您的问题,请在excel中打开文件,该文件可能会显示:

“’Filename.csv’的文件格式和扩展名不匹配。该文件可能已损坏或不安全。除非您信任它的来源,否则请不要打开它。是否仍要打开它?”

修复文件:在Excel中打开文件,单击“另存为”,选择要另存为的文件格式(使用.cvs),然后替换现有文件。

这是我的问题,并为我修复了错误。

Also be sure that your file is actually a CSV file. For example, if you had an .xls file, and simply changed the file extension to .csv, the file won’t import and will give the error above. To check to see if this is your problem open the file in excel and it will likely say:

“The file format and extension of ‘Filename.csv’ don’t match. The file could be corrupted or unsafe. Unless you trust its source, don’t open it. Do you want to open it anyway?”

To fix the file: open the file in Excel, click “Save As”, Choose the file format to save as (use .cvs), then replace the existing file.

This was my problem, and fixed the error for me.


Python Pandas:如何仅读取CSV文件的前n行?

问题:Python Pandas:如何仅读取CSV文件的前n行?

我有一个非常大的数据集,我无法读取其中的整个数据集。因此,我正在考虑只读取其中的一个数据块进行训练,但是我不知道该怎么做。任何想法将不胜感激。

I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it. Any thought will be appreciated.


回答 0

如果您只想读取前999,999行(非标题):

read_csv(..., nrows=999999)

如果您只想读取1,000,000 … 1,999,999行

read_csv(..., skiprows=1000000, nrows=999999)

nrows:int,默认值无要读取的文件行数。对读取大文件有用*

skiprows:类似于列表或整数的行号,在文件开头要跳过(索引为0)或要跳过的行数(整数)

对于大文件,您可能还需要使用chunksize:

chunksize:int,默认为None返回TextFileReader对象进行迭代

pandas.io.parsers.read_csv文档

If you only want to read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*

skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None Return TextFileReader object for iteration

pandas.io.parsers.read_csv documentation


用Spark加载CSV文件

问题:用Spark加载CSV文件

我是Spark的新手,正在尝试使用Spark从文件读取CSV数据。这是我在做什么:

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()

我希望此调用可以给我列出文件的前两列,但出现此错误:

File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range

尽管我的CSV文件不止一列。

I’m new to Spark and I’m trying to read CSV data from a file with Spark. Here’s what I am doing :

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()

I would expect this call to give me a list of the two first columns of my file but I’m getting this error :

File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range

although my CSV file as more than one column.


回答 0

您确定所有行都至少有2列?您可以尝试类似的方法吗?

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

或者,您可以打印罪魁祸首(如果有):

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()

Are you sure that all the lines have at least 2 columns? Can you try something like, just to check?:

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

Alternatively, you could print the culprit (if any):

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()

回答 1

Spark 2.0.0+

您可以直接使用内置的csv数据源:

spark.read.csv(
    "some_input_file.csv", header=True, mode="DROPMALFORMED", schema=schema
)

要么

(spark.read
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .csv("some_input_file.csv"))

不包括任何外部依赖项。

火花<2.0.0

我建议不要手动解析,这在一般情况下是不容易的,我建议spark-csv

确保星火CSV包含在路径(--packages--jars--driver-class-path

并按以下方式加载数据:

(df = sqlContext
    .read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferschema", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv"))

它可以处理加载,模式推断,删除格式错误的行,并且不需要将数据从Python传递到JVM。

注意事项

如果您知道架构,则最好避免架构推断并将其传递给DataFrameReader。假设您有三列-整数,双精度和字符串:

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv"))

Spark 2.0.0+

You can use built-in csv data source directly:

spark.read.csv(
    "some_input_file.csv", header=True, mode="DROPMALFORMED", schema=schema
)

or

(spark.read
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .csv("some_input_file.csv"))

without including any external dependencies.

Spark < 2.0.0:

Instead of manual parsing, which is far from trivial in a general case, I would recommend spark-csv:

Make sure that Spark CSV is included in the path (--packages, --jars, --driver-class-path)

And load your data as follows:

(df = sqlContext
    .read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferschema", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv"))

It can handle loading, schema inference, dropping malformed lines and doesn’t require passing data from Python to the JVM.

Note:

If you know the schema, it is better to avoid schema inference and pass it to DataFrameReader. Assuming you have three columns – integer, double and string:

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv"))

回答 2

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.csv("/home/stp/test1.csv",header=True,sep="|");

print(df.collect())
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.csv("/home/stp/test1.csv",header=True,sep="|")

print(df.collect())

回答 3

还有另一个选择,包括使用Pandas读取CSV文件,然后将Pandas DataFrame导入Spark。

例如:

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)

And yet another option which consist in reading the CSV file using Pandas and then importing the Pandas DataFrame into Spark.

For example:

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)

回答 4

只需按逗号分割也会将字段内的逗号分割(例如a,b,"1,2,3",c),因此不建议使用。如果要使用DataFrames API,zero323的答案很好,但是如果要坚持使用基本Spark,则可以使用csv模块在基本Python中解析csvs :

# works for both python 2 and 3
import csv
rdd = sc.textFile("file.csv")
rdd = rdd.mapPartitions(lambda x: csv.reader(x))

编辑:正如@muon在评论中提到的那样,它将像其他任何行一样对待标头,因此您需要手动提取它。例如,header = rdd.first(); rdd = rdd.filter(lambda x: x != header)(确保header在评估过滤器之前不要进行修改)。但是在这一点上,最好使用内置的csv解析器。

Simply splitting by comma will also split commas that are within fields (e.g. a,b,"1,2,3",c), so it’s not recommended. zero323’s answer is good if you want to use the DataFrames API, but if you want to stick to base Spark, you can parse csvs in base Python with the csv module:

# works for both python 2 and 3
import csv
rdd = sc.textFile("file.csv")
rdd = rdd.mapPartitions(lambda x: csv.reader(x))

EDIT: As @muon mentioned in the comments, this will treat the header like any other row so you’ll need to extract it manually. For example, header = rdd.first(); rdd = rdd.filter(lambda x: x != header) (make sure not to modify header before the filter evaluates). But at this point, you’re probably better off using a built-in csv parser.


回答 5

这是在PYSPARK中

path="Your file path with file name"

df=spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)

那你可以检查

df.show(5)
df.count()

This is in PYSPARK

path="Your file path with file name"

df=spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)

Then you can check

df.show(5)
df.count()

回答 6

如果要将csv加载为数据帧,则可以执行以下操作:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true') \
    .load('sampleFile.csv') # this is your csv file

对我来说很好。

If you want to load csv as a dataframe then you can do the following:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true') \
    .load('sampleFile.csv') # this is your csv file

It worked fine for me.


回答 7

这与JP Mercier最初提出的有关使用Pandas的建议是一致的,但进行了重大修改:如果将数据分块读取到Pandas中,应该更具延展性。这意味着,您可以解析比Pandas实际可处理的文件大得多的文件,并将其以较小的尺寸传递给Spark。(这也回答了有关为什么如果他们仍然可以将所有内容加载到Pandas中的人为什么要使用Spark的评论。)

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

Spark_Full = sc.emptyRDD()
chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000)
# if you have headers in your csv file:
headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns)

for chunky in chunk_100k:
    Spark_Full +=  sc.parallelize(chunky.values.tolist())

YourSparkDataFrame = Spark_Full.toDF(headers)
# if you do not have headers, leave empty instead:
# YourSparkDataFrame = Spark_Full.toDF()
YourSparkDataFrame.show()

This is in-line with what JP Mercier initially suggested about using Pandas, but with a major modification: If you read data into Pandas in chunks, it should be more malleable. Meaning, that you can parse a much larger file than Pandas can actually handle as a single piece and pass it to Spark in smaller sizes. (This also answers the comment about why one would want to use Spark if they can load everything into Pandas anyways.)

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

Spark_Full = sc.emptyRDD()
chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000)
# if you have headers in your csv file:
headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns)

for chunky in chunk_100k:
    Spark_Full +=  sc.parallelize(chunky.values.tolist())

YourSparkDataFrame = Spark_Full.toDF(headers)
# if you do not have headers, leave empty instead:
# YourSparkDataFrame = Spark_Full.toDF()
YourSparkDataFrame.show()

回答 8

现在,对于任何常规的csv文件,还有另一个选项:https : //github.com/seahboonsiew/pyspark-csv,如下所示:

假设我们具有以下上下文

sc = SparkContext
sqlCtx = SQLContext or HiveContext

首先,使用SparkContext将pyspark-csv.py分发给执行者

import pyspark_csv as pycsv
sc.addPyFile('pyspark_csv.py')

通过SparkContext读取CSV数据并将其转换为DataFrame

plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')
dataframe = pycsv.csvToDataFrame(sqlCtx, plaintext_rdd)

Now, there’s also another option for any general csv file: https://github.com/seahboonsiew/pyspark-csv as follows:

Assume we have the following context

sc = SparkContext
sqlCtx = SQLContext or HiveContext

First, distribute pyspark-csv.py to executors using SparkContext

import pyspark_csv as pycsv
sc.addPyFile('pyspark_csv.py')

Read csv data via SparkContext and convert it to DataFrame

plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')
dataframe = pycsv.csvToDataFrame(sqlCtx, plaintext_rdd)

回答 9

如果您的csv数据恰好在任何字段中都不包含换行符,则可以使用加载textFile()并解析数据

import csv
import StringIO

def loadRecord(line):
    input = StringIO.StringIO(line)
    reader = csv.DictReader(input, fieldnames=["name1", "name2"])
    return reader.next()

input = sc.textFile(inputFile).map(loadRecord)

If your csv data happens to not contain newlines in any of the fields, you can load your data with textFile() and parse it

import csv
import StringIO

def loadRecord(line):
    input = StringIO.StringIO(line)
    reader = csv.DictReader(input, fieldnames=["name1", "name2"])
    return reader.next()

input = sc.textFile(inputFile).map(loadRecord)

回答 10

如果数据集中的任何一个或多个行的列数少于或多于2,则可能会出现此错误。

我也是Pyspark的新手,正在尝试读取CSV文件。以下代码为我工作:

在这段代码中,我使用来自kaggle的数据集,链接为:https ://www.kaggle.com/carrie1/ecommerce-data

1.不提架构:

from pyspark.sql import SparkSession  
scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",")
sdfData.show()

现在检查列:sdfData.columns

输出将是:

['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']

检查每一列的数据类型:

sdfData.schema
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

这将为数据框提供所有列,其数据类型为StringType

2.使用架构: 如果您知道架构或想要更改上表中任何列的数据类型,请使用此格式(假设我正在关注以下列,并希望它们具有特定的数据类型)

from pyspark.sql import SparkSession  
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
    schema = StructType([\
        StructField("InvoiceNo", IntegerType()),\
        StructField("StockCode", StringType()), \
        StructField("Description", StringType()),\
        StructField("Quantity", IntegerType()),\
        StructField("InvoiceDate", StringType()),\
        StructField("CustomerID", DoubleType()),\
        StructField("Country", StringType())\
    ])

scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL example: Reading CSV file with schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema)

现在检查每个列的数据类型的架构:

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

编辑:我们也可以使用以下代码行,而无需明确提及架构:

sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True)
sdfData.schema

输出为:

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))

输出将如下所示:

sdfData.show()

+---------+---------+--------------------+--------+--------------+----------+-------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|CustomerID|Country|
+---------+---------+--------------------+--------+--------------+----------+-------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|      2.55|  17850|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|      2.75|  17850|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|12/1/2010 8:26|      7.65|  17850|
|   536365|    21730|GLASS STAR FROSTE...|       6|12/1/2010 8:26|      4.25|  17850|
|   536366|    22633|HAND WARMER UNION...|       6|12/1/2010 8:28|      1.85|  17850|
|   536366|    22632|HAND WARMER RED P...|       6|12/1/2010 8:28|      1.85|  17850|
|   536367|    84879|ASSORTED COLOUR B...|      32|12/1/2010 8:34|      1.69|  13047|
|   536367|    22745|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22748|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22749|FELTCRAFT PRINCES...|       8|12/1/2010 8:34|      3.75|  13047|
|   536367|    22310|IVORY KNITTED MUG...|       6|12/1/2010 8:34|      1.65|  13047|
|   536367|    84969|BOX OF 6 ASSORTED...|       6|12/1/2010 8:34|      4.25|  13047|
|   536367|    22623|BOX OF VINTAGE JI...|       3|12/1/2010 8:34|      4.95|  13047|
|   536367|    22622|BOX OF VINTAGE AL...|       2|12/1/2010 8:34|      9.95|  13047|
|   536367|    21754|HOME BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21755|LOVE BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21777|RECIPE BOX WITH M...|       4|12/1/2010 8:34|      7.95|  13047|
+---------+---------+--------------------+--------+--------------+----------+-------+
only showing top 20 rows

If you are having any one or more row(s) with less or more number of columns than 2 in the dataset then this error may arise.

I am also new to Pyspark and trying to read CSV file. Following code worked for me:

In this code I am using dataset from kaggle the link is: https://www.kaggle.com/carrie1/ecommerce-data

1. Without mentioning the schema:

from pyspark.sql import SparkSession  
scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",")
sdfData.show()

Now check the columns: sdfData.columns

Output will be:

['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']

Check the datatype for each column:

sdfData.schema
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

This will give the data frame with all the columns with datatype as StringType

2. With schema: If you know the schema or want to change the datatype of any column in the above table then use this (let’s say I am having following columns and want them in a particular data type for each of them)

from pyspark.sql import SparkSession  
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
    schema = StructType([\
        StructField("InvoiceNo", IntegerType()),\
        StructField("StockCode", StringType()), \
        StructField("Description", StringType()),\
        StructField("Quantity", IntegerType()),\
        StructField("InvoiceDate", StringType()),\
        StructField("CustomerID", DoubleType()),\
        StructField("Country", StringType())\
    ])

scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL example: Reading CSV file with schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema)

Now check the schema for datatype of each column:

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

Edited: We can use the following line of code as well without mentioning schema explicitly:

sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True)
sdfData.schema

The output is:

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))

The output will look like this:

sdfData.show()

+---------+---------+--------------------+--------+--------------+----------+-------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|CustomerID|Country|
+---------+---------+--------------------+--------+--------------+----------+-------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|      2.55|  17850|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|      2.75|  17850|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|12/1/2010 8:26|      7.65|  17850|
|   536365|    21730|GLASS STAR FROSTE...|       6|12/1/2010 8:26|      4.25|  17850|
|   536366|    22633|HAND WARMER UNION...|       6|12/1/2010 8:28|      1.85|  17850|
|   536366|    22632|HAND WARMER RED P...|       6|12/1/2010 8:28|      1.85|  17850|
|   536367|    84879|ASSORTED COLOUR B...|      32|12/1/2010 8:34|      1.69|  13047|
|   536367|    22745|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22748|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22749|FELTCRAFT PRINCES...|       8|12/1/2010 8:34|      3.75|  13047|
|   536367|    22310|IVORY KNITTED MUG...|       6|12/1/2010 8:34|      1.65|  13047|
|   536367|    84969|BOX OF 6 ASSORTED...|       6|12/1/2010 8:34|      4.25|  13047|
|   536367|    22623|BOX OF VINTAGE JI...|       3|12/1/2010 8:34|      4.95|  13047|
|   536367|    22622|BOX OF VINTAGE AL...|       2|12/1/2010 8:34|      9.95|  13047|
|   536367|    21754|HOME BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21755|LOVE BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21777|RECIPE BOX WITH M...|       4|12/1/2010 8:34|      7.95|  13047|
+---------+---------+--------------------+--------+--------------+----------+-------+
only showing top 20 rows

回答 11

使用时spark.read.csv,我发现使用这些选项escape='"'multiLine=TrueCSV标准提供最一致的解决方案,以我的经验,从Google表格中导出的CSV文件效果最好。

那是,

#set inferSchema=False to read everything as string
df = spark.read.csv("myData.csv", escape='"', multiLine=True,
     inferSchema=False, header=True)

When using spark.read.csv, I find that using the options escape='"' and multiLine=True provide the most consistent solution to the CSV standard, and in my experience works the best with CSV files exported from Google Sheets.

That is,

#set inferSchema=False to read everything as string
df = spark.read.csv("myData.csv", escape='"', multiLine=True,
     inferSchema=False, header=True)

ValueError:对关闭的文件进行I / O操作

问题:ValueError:对关闭的文件进行I / O操作

import csv    

with open('v.csv', 'w') as csvfile:
    cwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)

for w, c in p.items():
    cwriter.writerow(w + c)

这里,p是一本字典,w并且c都是字符串。

当我尝试写入文件时,它报告错误:

ValueError: I/O operation on closed file.
import csv    

with open('v.csv', 'w') as csvfile:
    cwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)

for w, c in p.items():
    cwriter.writerow(w + c)

Here, p is a dictionary, w and c both are strings.

When I try to write to the file it reports the error:

ValueError: I/O operation on closed file.

回答 0

正确缩进;您的for陈述应在with区块内:

import csv    

with open('v.csv', 'w') as csvfile:
    cwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    for w, c in p.items():
        cwriter.writerow(w + c)

with块外部,文件已关闭。

>>> with open('/tmp/1', 'w') as f:
...     print(f.closed)
... 
False
>>> print(f.closed)
True

Indent correctly; your for statement should be inside the with block:

import csv    

with open('v.csv', 'w') as csvfile:
    cwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    for w, c in p.items():
        cwriter.writerow(w + c)

Outside the with block, the file is closed.

>>> with open('/tmp/1', 'w') as f:
...     print(f.closed)
... 
False
>>> print(f.closed)
True

回答 1

混合使用:制表符+空格会引起相同的错误

with open('/foo', 'w') as f:
 (spaces OR  tab) print f       <-- success
 (spaces AND tab) print f       <-- fail

Same error can raise by mixing: tabs + spaces.

with open('/foo', 'w') as f:
 (spaces OR  tab) print f       <-- success
 (spaces AND tab) print f       <-- fail

计算CSV Python中有多少行?

问题:计算CSV Python中有多少行?

我正在使用python(Django Framework)读取CSV文件。如您所见,我仅从该CSV中提取了2行。我一直在尝试将CS​​V的总行数存储在变量中。

如何获得总行数?

file = object.myfilePath
fileObject = csv.reader(file)
for i in range(2):
    data.append(fileObject.next()) 

我努力了:

len(fileObject)
fileObject.length

I’m using python (Django Framework) to read a CSV file. I pull just 2 lines out of this CSV as you can see. What I have been trying to do is store in a variable the total number of rows the CSV also.

How can I get the total number of rows?

file = object.myfilePath
fileObject = csv.reader(file)
for i in range(2):
    data.append(fileObject.next()) 

I have tried:

len(fileObject)
fileObject.length

回答 0

您需要计算行数:

row_count = sum(1 for row in fileObject)  # fileObject is your csv.reader

使用sum()与生成器表达使一个有效的计数器,从而避免在存储器中存储整个文件。

如果您已经开始阅读两行,那么您需要将这两行加到总计中;已读取的行不计在内。

You need to count the number of rows:

row_count = sum(1 for row in fileObject)  # fileObject is your csv.reader

Using sum() with a generator expression makes for an efficient counter, avoiding storing the whole file in memory.

If you already read 2 rows to start with, then you need to add those 2 rows to your total; rows that have already been read are not being counted.


回答 1

2018-10-29编辑

谢谢你的意见。

我测试了几种代码来获取csv文件中的行数(以速度为单位)。最好的方法如下。

with open(filename) as f:
    sum(1 for line in f)

这是经过测试的代码。

import timeit
import csv
import pandas as pd

filename = './sample_submission.csv'

def talktime(filename, funcname, func):
    print(f"# {funcname}")
    t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
    print('Elapsed time : ', t)
    print('n = ', func(filename))
    print('\n')

def sum1forline(filename):
    with open(filename) as f:
        return sum(1 for line in f)
talktime(filename, 'sum1forline', sum1forline)

def lenopenreadlines(filename):
    with open(filename) as f:
        return len(f.readlines())
talktime(filename, 'lenopenreadlines', lenopenreadlines)

def lenpd(filename):
    return len(pd.read_csv(filename)) + 1
talktime(filename, 'lenpd', lenpd)

def csvreaderfor(filename):
    cnt = 0
    with open(filename) as f:
        cr = csv.reader(f)
        for row in cr:
            cnt += 1
    return cnt
talktime(filename, 'csvreaderfor', csvreaderfor)

def openenum(filename):
    cnt = 0
    with open(filename) as f:
        for i, line in enumerate(f,1):
            cnt += 1
    return cnt
talktime(filename, 'openenum', openenum)

结果如下。

# sum1forline
Elapsed time :  0.6327946722068599
n =  2528244


# lenopenreadlines
Elapsed time :  0.655304473598555
n =  2528244


# lenpd
Elapsed time :  0.7561274056295324
n =  2528244


# csvreaderfor
Elapsed time :  1.5571560935772661
n =  2528244


# openenum
Elapsed time :  0.773000013928679
n =  2528244

总之,sum(1 for line in f)是最快的。但是与可能没有太大区别len(f.readlines())

sample_submission.csv 是30.2MB,具有3100万个字符。

2018-10-29 EDIT

Thank you for the comments.

I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.

with open(filename) as f:
    sum(1 for line in f)

Here is the code tested.

import timeit
import csv
import pandas as pd

filename = './sample_submission.csv'

def talktime(filename, funcname, func):
    print(f"# {funcname}")
    t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
    print('Elapsed time : ', t)
    print('n = ', func(filename))
    print('\n')

def sum1forline(filename):
    with open(filename) as f:
        return sum(1 for line in f)
talktime(filename, 'sum1forline', sum1forline)

def lenopenreadlines(filename):
    with open(filename) as f:
        return len(f.readlines())
talktime(filename, 'lenopenreadlines', lenopenreadlines)

def lenpd(filename):
    return len(pd.read_csv(filename)) + 1
talktime(filename, 'lenpd', lenpd)

def csvreaderfor(filename):
    cnt = 0
    with open(filename) as f:
        cr = csv.reader(f)
        for row in cr:
            cnt += 1
    return cnt
talktime(filename, 'csvreaderfor', csvreaderfor)

def openenum(filename):
    cnt = 0
    with open(filename) as f:
        for i, line in enumerate(f,1):
            cnt += 1
    return cnt
talktime(filename, 'openenum', openenum)

The result was below.

# sum1forline
Elapsed time :  0.6327946722068599
n =  2528244


# lenopenreadlines
Elapsed time :  0.655304473598555
n =  2528244


# lenpd
Elapsed time :  0.7561274056295324
n =  2528244


# csvreaderfor
Elapsed time :  1.5571560935772661
n =  2528244


# openenum
Elapsed time :  0.773000013928679
n =  2528244

In conclusion, sum(1 for line in f) is fastest. But there might not be significant difference from len(f.readlines()).

sample_submission.csv is 30.2MB and has 31 million characters.


回答 2

为此,您需要像我的示例一样有一些代码:

file = open("Task1.csv")
numline = len(file.readlines())
print (numline)

希望对大家有帮助。

To do it you need to have a bit of code like my example here:

file = open("Task1.csv")
numline = len(file.readlines())
print (numline)

I hope this helps everyone.


回答 3

上面的一些建议计算了csv文件中的LINES数量。但是某些CSV文件将包含带引号的字符串,这些字符串本身包含换行符。MS CSV文件通常用\ r \ n分隔记录,但在带引号的字符串中单独使用\ n。

对于这样的文件,计算文件中的文本行(由换行符分隔)将导致太大的结果。因此,为了获得准确的计数,您需要使用csv.reader来读取记录。

Several of the above suggestions count the number of LINES in the csv file. But some CSV files will contain quoted strings which themselves contain newline characters. MS CSV files usually delimit records with \r\n, but use \n alone within quoted strings.

For a file like this, counting lines of text (as delimited by newline) in the file will give too large a result. So for an accurate count you need to use csv.reader to read the records.


回答 4

首先,您必须使用open打开文件

input_file = open("nameOfFile.csv","r+")

然后使用csv.reader打开csv

reader_file = csv.reader(input_file)

最后,您可以使用“ len”指令获取行数

value = len(list(reader_file))

总代码是这样的:

input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))

请记住,如果要重复使用csv文件,则必须创建一个input_file.fseek(0),因为当您使用reader_file的列表时,它将读取所有文件,并且文件中的指针会更改其位置

First you have to open the file with open

input_file = open("nameOfFile.csv","r+")

Then use the csv.reader for open the csv

reader_file = csv.reader(input_file)

At the last, you can take the number of row with the instruction ‘len’

value = len(list(reader_file))

The total code is this:

input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))

Remember that if you want to reuse the csv file, you have to make a input_file.fseek(0), because when you use a list for the reader_file, it reads all file, and the pointer in the file change its position


回答 5

row_count = sum(1 for line in open(filename)) 为我工作。

注意:sum(1 for line in csv.reader(filename))似乎要计算第一行的长度

row_count = sum(1 for line in open(filename)) worked for me.

Note : sum(1 for line in csv.reader(filename)) seems to calculate the length of first line


回答 6

numline = len(file_read.readlines())
numline = len(file_read.readlines())

回答 7

当实例化一个csv.reader对象并遍历整个文件时,可以访问提供行数的名为line_num的实例变量:

import csv
with open('csv_path_file') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        pass
    print(csv_reader.line_num)

when you instantiate a csv.reader object and you iter the whole file then you can access an instance variable called line_num providing the row count:

import csv
with open('csv_path_file') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        pass
    print(csv_reader.line_num)

回答 8

import csv
count = 0
with open('filename.csv', 'rb') as count_file:
    csv_reader = csv.reader(count_file)
    for row in csv_reader:
        count += 1

print count
import csv
count = 0
with open('filename.csv', 'rb') as count_file:
    csv_reader = csv.reader(count_file)
    for row in csv_reader:
        count += 1

print count

回答 9

使用“列表”适合更可行的对象。

然后,您可以计数,跳过,变异,直到您的内心渴望:

list(fileObject) #list values

len(list(fileObject)) # get length of file lines

list(fileObject)[10:] # skip first 10 lines

Use “list” to fit a more workably object.

You can then count, skip, mutate till your heart’s desire:

list(fileObject) #list values

len(list(fileObject)) # get length of file lines

list(fileObject)[10:] # skip first 10 lines

回答 10

您还可以使用经典的for循环:

import pandas as pd
df = pd.read_csv('your_file.csv')

count = 0
for i in df['a_column']:
    count = count + 1

print(count)

You can also use a classic for loop:

import pandas as pd
df = pd.read_csv('your_file.csv')

count = 0
for i in df['a_column']:
    count = count + 1

print(count)

回答 11

可能想在命令行中尝试以下简单的操作:

sed -n '$=' filename 要么 wc -l filename

might want to try something as simple as below in the command line:

sed -n '$=' filename

or

wc -l filename

回答 12

这适用于csv和所有基于Unix的操作系统中包含字符串的文件:

import os

numOfLines = int(os.popen('wc -l < file.csv').read()[:-1])

如果csv文件包含一个字段行,则可以从numOfLines上面扣除一个:

numOfLines = numOfLines - 1

This works for csv and all files containing strings in Unix-based OSes:

import os

numOfLines = int(os.popen('wc -l < file.csv').read()[:-1])

In case the csv file contains a fields row you can deduct one from numOfLines above:

numOfLines = numOfLines - 1

回答 13

我认为我们可以改善最佳答案,我正在使用:

len = sum(1 for _ in reader)

此外,我们不应忘记pythonic代码在项目中并非总是具有最佳性能。例如:如果我们可以在同一数据集中同时执行更多操作,最好在同一个气泡中全部进行操作,而不是制作两个或多个pythonic气泡。

I think we can improve the best answer a little bit, I’m using:

len = sum(1 for _ in reader)

Moreover, we shouldnt forget pythonic code not always have the best performance in the project. In example: If we can do more operations at the same time in the same data set Its better to do all in the same bucle instead make two or more pythonic bucles.


回答 14

尝试

data = pd.read_csv("data.csv")
data.shape

在输出中,您可以看到类似(aa,bb)的图形,其中aa是行数

try

data = pd.read_csv("data.csv")
data.shape

and in the output you can see something like (aa,bb) where aa is the # of rows


回答 15

import pandas as pd
data = pd.read_csv('data.csv') 
totalInstances=len(data)
import pandas as pd
data = pd.read_csv('data.csv') 
totalInstances=len(data)

读取巨大的.csv文件

问题:读取巨大的.csv文件

我目前正在尝试从Python 2.7中的.csv文件读取数据,该文件最多包含100万行和200列(文件范围从100mb到1.6gb)。对于少于300,000行的文件,我可以(非常缓慢地)执行此操作,但是一旦超过该行,就会出现内存错误。我的代码如下所示:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

在getstuff函数中使用else子句的原因是,所有符合条件的元素都将一起列在csv文件中,因此当我经过它们时,为了节省时间,我离开了循环。

我的问题是:

  1. 我如何设法使其与较大的文件一起使用?

  2. 有什么办法可以使它更快?

我的计算机具有8gb RAM,运行64位Windows 7,处理器为3.40 GHz(不确定您需要什么信息)。

I’m currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My questions are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).


回答 0

您正在将所有行读入列表,然后处理该列表。不要那样做

在生成行时对其进行处理。如果需要先过滤数据,请使用生成器函数:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

我还简化了您的过滤器测试;逻辑相同,但更为简洁。

因为只匹配与条件匹配的单个行序列,所以还可以使用:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

您现在可以getstuff()直接循环。在getdata()

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

现在直接getdata()在您的代码中循环:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

现在,您仅在内存中保留一行,而不是每个条件存储数千行。

yield使函数成为生成器函数,这意味着直到开始循环它之前,它不会做任何工作。

You are reading all rows into a list, then processing that list. Don’t do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won’t do any work until you start looping over it.


回答 1

尽管Martijin的答案是最好的。这是为初学者处理大型csv文件的更直观的方法。这使您可以一次处理一组行或块。

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Although Martijin’s answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

回答 2

我进行了大量的振动分析,并研究了大型数据集(数以亿计的点)。我的测试显示pandas.read_csv()函数比numpy.genfromtxt()快20倍。genfromtxt()函数比numpy.loadtxt()快3倍。似乎您需要大数据集的熊猫。

我在博客上讨论了用于测试的代码和数据集,该博客讨论了MATLAB vs Python进行振动分析

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.


回答 3

对我有用的是而且超快速的是

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

另一个可行的解决方案是:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

what worked for me was and is superfast is

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

回答 4

对于着陆这个问题的人。将熊猫与’ chunksize ‘和’ usecols ‘ 一起使用,比其他建议的选项更快地读取了一个巨大的zip文件。

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

回答 5

这是Python3的另一个解决方案:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

datareader是一个生成器函数。

here’s another solution for Python3:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

here datareader is a generator function.


回答 6

如果您使用的是熊猫并且有很多RAM(足以将整个文件读入内存),请尝试使用pd.read_csvwith low_memory=False,例如:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)

If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csv with low_memory=False, e.g.:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)

使用Python将CSV文件导入sqlite3数据库表

问题:使用Python将CSV文件导入sqlite3数据库表

我有一个CSV文件,我想使用Python将该文件批量导入到我的sqlite3数据库中。该命令是“ .import …..”。但似乎不能这样工作。谁能给我一个在sqlite3中如何做的例子?我正在使用Windows,以防万一。谢谢

I have a CSV file and I want to bulk-import this file into my sqlite3 database using Python. the command is “.import …..”. but it seems that it cannot work like this. Can anyone give me an example of how to do it in sqlite3? I am using windows just in case. Thanks


回答 0

import csv, sqlite3

con = sqlite3.connect(":memory:") # change to 'sqlite:///your_filename.db'
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);") # use your column names here

with open('data.csv','r') as fin: # `with` statement available in 2.5+
    # csv.DictReader uses first line in file for column headings by default
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db = [(i['col1'], i['col2']) for i in dr]

cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()
con.close()
import csv, sqlite3

con = sqlite3.connect(":memory:") # change to 'sqlite:///your_filename.db'
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);") # use your column names here

with open('data.csv','r') as fin: # `with` statement available in 2.5+
    # csv.DictReader uses first line in file for column headings by default
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db = [(i['col1'], i['col2']) for i in dr]

cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()
con.close()

回答 1

留给读者一个创建与磁盘上文件的sqlite连接的练习…但是熊猫库现在提供了两种方法

df = pandas.read_csv(csvfile)
df.to_sql(table_name, conn, if_exists='append', index=False)

Creating an sqlite connection to a file on disk is left as an exercise for the reader … but there is now a two-liner made possible by the pandas library

df = pandas.read_csv(csvfile)
df.to_sql(table_name, conn, if_exists='append', index=False)

回答 2

我的2美分(更通用):

import csv, sqlite3
import logging

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile, outputToFile = False):
    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("%s %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "CREATE TABLE ads (%s)" % ",".join(cols)

        con = sqlite3.connect(":memory:")
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO ads VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()

    return con

My 2 cents (more generic):

import csv, sqlite3
import logging

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile, outputToFile = False):
    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("%s %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "CREATE TABLE ads (%s)" % ",".join(cols)

        con = sqlite3.connect(":memory:")
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO ads VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()

    return con

回答 3

.import命令是sqlite3命令行工具的功能。要在Python中完成此操作,您应该简单地使用Python具备的任何功能(例如csv模块)加载数据,然后照常插入数据。

这样,您还可以控制要插入的类型,而不必依赖sqlite3似乎没有记录的行为。

The .import command is a feature of the sqlite3 command-line tool. To do it in Python, you should simply load the data using whatever facilities Python has, such as the csv module, and inserting the data as per usual.

This way, you also have control over what types are inserted, rather than relying on sqlite3’s seemingly undocumented behaviour.


回答 4

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys, csv, sqlite3

def main():
    con = sqlite3.connect(sys.argv[1]) # database file input
    cur = con.cursor()
    cur.executescript("""
        DROP TABLE IF EXISTS t;
        CREATE TABLE t (COL1 TEXT, COL2 TEXT);
        """) # checks to see if table exists and makes a fresh table.

    with open(sys.argv[2], "rb") as f: # CSV file input
        reader = csv.reader(f, delimiter=',') # no header information with delimiter
        for row in reader:
            to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8")] # Appends data from CSV file representing and handling of text
            cur.execute("INSERT INTO neto (COL1, COL2) VALUES(?, ?);", to_db)
            con.commit()
    con.close() # closes connection to database

if __name__=='__main__':
    main()
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys, csv, sqlite3

def main():
    con = sqlite3.connect(sys.argv[1]) # database file input
    cur = con.cursor()
    cur.executescript("""
        DROP TABLE IF EXISTS t;
        CREATE TABLE t (COL1 TEXT, COL2 TEXT);
        """) # checks to see if table exists and makes a fresh table.

    with open(sys.argv[2], "rb") as f: # CSV file input
        reader = csv.reader(f, delimiter=',') # no header information with delimiter
        for row in reader:
            to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8")] # Appends data from CSV file representing and handling of text
            cur.execute("INSERT INTO neto (COL1, COL2) VALUES(?, ?);", to_db)
            con.commit()
    con.close() # closes connection to database

if __name__=='__main__':
    main()

回答 5

非常感谢bernie的回答!必须进行一些调整-这对我有用:

import csv, sqlite3
conn = sqlite3.connect("pcfc.sl3")
curs = conn.cursor()
curs.execute("CREATE TABLE PCFC (id INTEGER PRIMARY KEY, type INTEGER, term TEXT, definition TEXT);")
reader = csv.reader(open('PC.txt', 'r'), delimiter='|')
for row in reader:
    to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8"), unicode(row[2], "utf8")]
    curs.execute("INSERT INTO PCFC (type, term, definition) VALUES (?, ?, ?);", to_db)
conn.commit()

我的文本文件(PC.txt)如下所示:

1 | Term 1 | Definition 1
2 | Term 2 | Definition 2
3 | Term 3 | Definition 3

Many thanks for bernie’s answer! Had to tweak it a bit – here’s what worked for me:

import csv, sqlite3
conn = sqlite3.connect("pcfc.sl3")
curs = conn.cursor()
curs.execute("CREATE TABLE PCFC (id INTEGER PRIMARY KEY, type INTEGER, term TEXT, definition TEXT);")
reader = csv.reader(open('PC.txt', 'r'), delimiter='|')
for row in reader:
    to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8"), unicode(row[2], "utf8")]
    curs.execute("INSERT INTO PCFC (type, term, definition) VALUES (?, ?, ?);", to_db)
conn.commit()

My text file (PC.txt) looks like this:

1 | Term 1 | Definition 1
2 | Term 2 | Definition 2
3 | Term 3 | Definition 3

回答 6

没错,这.import是正确的方法,但这是来自SQLite3.exe Shell的命令。这个问题的很多顶级答案都涉及本机python循环,但如果文件很大(我的记录是10 ^ 6到10 ^ 7记录),则要避免将所有内容读入熊猫或使用本机python列表理解/循环(尽管我没有时间比较它们)。

对于大文件,我认为最好的选择是使用预先创建空表,sqlite3.execute("CREATE TABLE...")从CSV文件中删除标题,然后用于subprocess.run()执行sqlite的import语句。由于我相信最后一部分是最相关的,所以我将从这一点开始。

subprocess.run()

from pathlib import Path
db_name = Path('my.db').resolve()
csv_file = Path('file.csv').resolve()
result = subprocess.run(['sqlite3',
                         str(db_name),
                         '-cmd',
                         '.mode csv',
                         '.import '+str(csv_file).replace('\\','\\\\')
                                 +' <table_name>'],
                        capture_output=True)

解释
在命令行中,您要查找的命令是sqlite3 my.db -cmd ".mode csv" ".import file.csv table"subprocess.run()运行命令行过程。to的参数subprocess.run()是一串字符串,这些字符串被解释为命令,后跟所有参数。

  • sqlite3 my.db 打开数据库
  • -cmd数据库允许您将多个跟进命令传递给sqlite程序后添加标志。在外壳中,每个命令都必须用引号引起来,但是在这里,它们只需要成为序列中自己的元素即可
  • '.mode csv' 符合您的期望
  • '.import '+str(csv_file).replace('\\','\\\\')+' <table_name>'是导入命令。
    不幸的是,由于子进程将所有后续操作都传递给带-cmd引号的字符串,因此如果您有Windows目录路径,则需要将反斜杠加倍。

剥离标题

这并不是问题的重点,但这是我所使用的。同样,我不想在任何时候将整个文件读入内存:

with open(csv, "r") as source:
    source.readline()
    with open(str(csv)+"_nohead", "w") as target:
        shutil.copyfileobj(source, target)

You’re right that .import is the way to go, but that’s a command from the SQLite3.exe shell. A lot of the top answers to this question involve native python loops, but if your files are large (mine are 10^6 to 10^7 records), you want to avoid reading everything into pandas or using a native python list comprehension/loop (though I did not time them for comparison).

For large files, I believe the best option is to create the empty table in advance using sqlite3.execute("CREATE TABLE..."), strip the headers from your CSV files, and then use subprocess.run() to execute sqlite’s import statement. Since the last part is I believe the most pertinent, I will start with that.

subprocess.run()

from pathlib import Path
db_name = Path('my.db').resolve()
csv_file = Path('file.csv').resolve()
result = subprocess.run(['sqlite3',
                         str(db_name),
                         '-cmd',
                         '.mode csv',
                         '.import '+str(csv_file).replace('\\','\\\\')
                                 +' <table_name>'],
                        capture_output=True)

Explanation
From the command line, the command you’re looking for is sqlite3 my.db -cmd ".mode csv" ".import file.csv table". subprocess.run() runs a command line process. The argument to subprocess.run() is a sequence of strings which are interpreted as a command followed by all of it’s arguments.

  • sqlite3 my.db opens the database
  • -cmd flag after the database allows you to pass multiple follow on commands to the sqlite program. In the shell, each command has to be in quotes, but here, they just need to be their own element of the sequence
  • '.mode csv' does what you’d expect
  • '.import '+str(csv_file).replace('\\','\\\\')+' <table_name>' is the import command.
    Unfortunately, since subprocess passes all follow-ons to -cmd as quoted strings, you need to double up your backslashes if you have a windows directory path.

Stripping Headers

Not really the main point of the question, but here’s what I used. Again, I didn’t want to read the whole files into memory at any point:

with open(csv, "r") as source:
    source.readline()
    with open(str(csv)+"_nohead", "w") as target:
        shutil.copyfileobj(source, target)


回答 7

基于Guy L解决方案(喜欢它),但可以处理转义的字段。

import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()

Based on Guy L solution (Love it) but can handle escaped fields.

import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()

回答 8

为此,您可以使用blazeodo有效

import blaze as bz
csv_path = 'data.csv'
bz.odo(csv_path, 'sqlite:///data.db::data')

Odo将csv文件存储到data.db该模式下的(sqlite数据库)data

或者您odo直接使用,而无需使用blaze。两种方法都可以。阅读本文档

You can do this using blaze & odo efficiently

import blaze as bz
csv_path = 'data.csv'
bz.odo(csv_path, 'sqlite:///data.db::data')

Odo will store the csv file to data.db (sqlite database) under the schema data

Or you use odo directly, without blaze. Either ways is fine. Read this documentation


回答 9

如果必须将CSV文件作为python程序的一部分导入,则为简便起见,可以os.system按照以下建议使用:

import os

cmd = """sqlite3 database.db <<< ".import input.csv mytable" """

rc = os.system(cmd)

print(rc)

关键是,通过指定数据库的文件名,假设读取数据没有错误,数据将自动保存。

If the CSV file must be imported as part of a python program, then for simplicity and efficiency, you could use os.system along the lines suggested by the following:

import os

cmd = """sqlite3 database.db <<< ".import input.csv mytable" """

rc = os.system(cmd)

print(rc)

The point is that by specifying the filename of the database, the data will automatically be saved, assuming there are no errors reading it.


回答 10

import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

        # Need data to decide
        if len(data) == 0:
            continue

        if data.isdigit():
            fieldTypes[field] = "INTEGER"
        else:
            fieldTypes[field] = "TEXT"
    # TODO: Currently there's no support for DATE in sqllite

if len(feildslLeft) > 0:
    raise Exception("Failed to find all the columns data types - Maybe some are empty?")

return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()
import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

        # Need data to decide
        if len(data) == 0:
            continue

        if data.isdigit():
            fieldTypes[field] = "INTEGER"
        else:
            fieldTypes[field] = "TEXT"
    # TODO: Currently there's no support for DATE in sqllite

if len(feildslLeft) > 0:
    raise Exception("Failed to find all the columns data types - Maybe some are empty?")

return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()

回答 11

为了简单起见,您可以使用项目的Makefile中的sqlite3命令行工具。

%.sql3: %.csv
    rm -f $@
    sqlite3 $@ -echo -cmd ".mode csv" ".import $< $*"
%.dump: %.sql3
    sqlite3 $< "select * from $*"

make test.sql3然后从现有的test.csv文件使用单个表“ test”创建sqlite数据库。然后make test.dump,您可以验证内容。

in the interest of simplicity, you could use the sqlite3 command line tool from the Makefile of your project.

%.sql3: %.csv
    rm -f $@
    sqlite3 $@ -echo -cmd ".mode csv" ".import $< $*"
%.dump: %.sql3
    sqlite3 $< "select * from $*"

make test.sql3 then creates the sqlite database from an existing test.csv file, with a single table “test”. you can then make test.dump to verify the contents.


回答 12

我发现有必要分批从csv到数据库的数据传输,以免耗尽内存。可以这样完成:

import csv
import sqlite3
from operator import itemgetter

# Establish connection
conn = sqlite3.connect("mydb.db")

# Create the table 
conn.execute(
    """
    CREATE TABLE persons(
        person_id INTEGER,
        last_name TEXT, 
        first_name TEXT, 
        address TEXT
    )
    """
)

# These are the columns from the csv that we want
cols = ["person_id", "last_name", "first_name", "address"]

# If the csv file is huge, we instead add the data in chunks
chunksize = 10000

# Parse csv file and populate db in chunks
with conn, open("persons.csv") as f:
    reader = csv.DictReader(f)

    chunk = []
    for i, row in reader: 

        if i % chunksize == 0 and i > 0:
            conn.executemany(
                """
                INSERT INTO persons
                    VALUES(?, ?, ?, ?)
                """, chunk
            )
            chunk = []

        items = itemgetter(*cols)(row)
        chunk.append(items)

I’ve found that it can be necessary to break up the transfer of data from the csv to the database in chunks as to not run out of memory. This can be done like this:

import csv
import sqlite3
from operator import itemgetter

# Establish connection
conn = sqlite3.connect("mydb.db")

# Create the table 
conn.execute(
    """
    CREATE TABLE persons(
        person_id INTEGER,
        last_name TEXT, 
        first_name TEXT, 
        address TEXT
    )
    """
)

# These are the columns from the csv that we want
cols = ["person_id", "last_name", "first_name", "address"]

# If the csv file is huge, we instead add the data in chunks
chunksize = 10000

# Parse csv file and populate db in chunks
with conn, open("persons.csv") as f:
    reader = csv.DictReader(f)

    chunk = []
    for i, row in reader: 

        if i % chunksize == 0 and i > 0:
            conn.executemany(
                """
                INSERT INTO persons
                    VALUES(?, ?, ?, ?)
                """, chunk
            )
            chunk = []

        items = itemgetter(*cols)(row)
        chunk.append(items)


Python CSV错误:行包含NULL字节

问题:Python CSV错误:行包含NULL字节

我正在使用以下代码处理一些CSV文件:

reader = csv.reader(open(filepath, "rU"))
try:
    for row in reader:
        print 'Row read successfully!', row
except csv.Error, e:
    sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

一个文件引发此错误:

file my.csv, line 1: line contains NULL byte

我能做什么?Google似乎建议它可能是Excel文件,未正确保存为.csv。有什么办法可以解决Python中的这个问题?

==更新==

在下面@JohnMachin的评论之后,我尝试将以下行添加到脚本中:

print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')

这是我得到的输出:

'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834

因此,该文件确实包含NUL字节。

I’m working with some CSV files, with the following code:

reader = csv.reader(open(filepath, "rU"))
try:
    for row in reader:
        print 'Row read successfully!', row
except csv.Error, e:
    sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

And one file is throwing this error:

file my.csv, line 1: line contains NULL byte

What can I do? Google seems to suggest that it may be an Excel file that’s been saved as a .csv improperly. Is there any way I can get round this problem in Python?

== UPDATE ==

Following @JohnMachin’s comment below, I tried adding these lines to my script:

print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')

And this is the output I got:

'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834

So the file does indeed contain NUL bytes.


回答 0

正如@ S.Lott所说,您应该以“ rb”模式而不是“ rU”模式打开文件。但是,这可能不会引起您当前的问题。据我所知,如果\r数据中嵌入了“ rU”模式,则会使您大失所望,但不会引起任何其他麻烦。我还注意到您有几个文件(全部以’rU’??打开),但只有一个会引起问题。

如果csv模块说您的文件中有一个“ NULL”(愚蠢的消息,应为“ NUL”)字节,那么您需要检查文件中的内容。我建议即使使用’rb’可以解决问题,您也可以这样做。

repr()是(或想成为)调试朋友。它会以独立于平台的方式明确显示您所拥有的(这对不知道od是什么或做什么的帮助者很有帮助)。做这个:

print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file

并仔细地将结果复制/粘贴(请勿重新输入)以编辑您的问题(而不是评论)。

还要注意,如果文件确实很模糊,例如在距文件开头的合理距离内没有\ r或\ n,则报告的行号reader.line_num将(无益)1. \x00通过执行以下操作查找第一个位置(如果有)

data = open('my.csv', 'rb').read()
print data.find('\x00')

并确保至少使用repr或od转储那么多字节。

是什么data.count('\x00')告诉你吗?如果有很多,您可能想要做类似的事情

for i, c in enumerate(data):
    if c == '\x00':
        print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])

这样您就可以在上下文中看到NUL字节。

如果你可以看到\x00在输出(或者\0在你的od -c输出),那么你肯定有在文件中NULL字节(S),你需要做这样的事情:

fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()

顺便说一句,您是否使用文本编辑器查看了文件(包括最后几行)?它实际上看起来像其他文件(没有“ NULL字节”exceptions)一样的合理CSV文件吗?

As @S.Lott says, you should be opening your files in ‘rb’ mode, not ‘rU’ mode. However that may NOT be causing your current problem. As far as I know, using ‘rU’ mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with ‘rU’ ??) but only one causing a problem.

If the csv module says that you have a “NULL” (silly message, should be “NUL”) byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using ‘rb’ makes the problem go away.

repr() is (or wants to be) your debugging friend. It will show unambiguously what you’ve got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:

print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file

and carefully copy/paste (don’t retype) the result into an edit of your question (not into a comment).

Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing

data = open('my.csv', 'rb').read()
print data.find('\x00')

and make sure that you dump at least that many bytes with repr or od.

What does data.count('\x00') tell you? If there are many, you may want to do something like

for i, c in enumerate(data):
    if c == '\x00':
        print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])

so that you can see the NUL bytes in context.

If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:

fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()

By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no “NULL byte” exception) files?


回答 1

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")

这对我有用。

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")

This works for me.


回答 2

将其读取为UTF-16也是我的问题。

这是我的代码,最终起作用了:

f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
    print row

其中location是csv文件的目录。

Reading it as UTF-16 was also my problem.

Here’s my code that ended up working:

f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
    print row

Where location is the directory of your csv file.


回答 3

我也遇到了这个问题。使用Python csv模块,我试图读取在MS Excel中创建的XLS文件,NULL byte并遇到遇到的错误。我环顾四周,发现了xlrd Python模块,用于从MS Excel电子表格文件读取数据并设置其格式。使用该xlrd模块,我不仅能够正确读取文件,而且还可以以前所未有的方式访问文件的许多不同部分。

我认为这可能对您有帮助。

I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn’t before.

I thought it might help you.


回答 4

将源文件的编码从UTF-16转换为UTF-8解决了我的问题。

如何在Python中将文件转换为utf-8?

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.

How to convert a file to utf-8 in Python?

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

回答 5

如果要假装不存在空值,则可以内联生成器以过滤掉空值。当然,这是假设空字节实际上不是编码的一部分,而是某种错误的工件或错误。

with open(filepath, "rb") as f:
    reader = csv.reader( (line.replace('\0','') for line in f) )

    try:
        for row in reader:
            print 'Row read successfully!', row
    except csv.Error, e:
        sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

You could just inline a generator to filter out the null values if you want to pretend they don’t exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

with open(filepath, "rb") as f:
    reader = csv.reader( (line.replace('\0','') for line in f) )

    try:
        for row in reader:
            print 'Row read successfully!', row
    except csv.Error, e:
        sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

回答 6

你为什么做这个?

 reader = csv.reader(open(filepath, "rU"))

文档非常清楚,您必须执行以下操作:

with open(filepath, "rb") as src:
    reader= csv.reader( src )

该模式必须为“ rb”才能读取。

http://docs.python.org/library/csv.html#csv.reader

如果csvfile是文件对象,则必须在有区别的平台上使用“ b”标志打开它。

Why are you doing this?

 reader = csv.reader(open(filepath, "rU"))

The docs are pretty clear that you must do this:

with open(filepath, "rb") as src:
    reader= csv.reader( src )

The mode must be “rb” to read.

http://docs.python.org/library/csv.html#csv.reader

If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.


回答 7

显然,这是XLS文件,而不是CSV文件,如http://www.garykessler.net/library/file_sigs.html确认

appparently it’s a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm


回答 8

代替csv阅读器,我对字符串使用read文件和split函数:

lines = open(input_file,'rb') 

for line_all in lines:

    line=line_all.replace('\x00', '').split(";")

Instead of csv reader I use read file and split function for string:

lines = open(input_file,'rb') 

for line_all in lines:

    line=line_all.replace('\x00', '').split(";")

回答 9

我遇到了同样的错误。将文件保存在UTF-8中,可以正常工作。

I got the same error. Saved the file in UTF-8 and it worked.


回答 10

当我使用OpenOffice Calc创建CSV文件时,这发生在我身上。当我在文本编辑器中创建CSV文件时,即使以后使用Calc编辑它,也没有发生。

通过将文本从Calc创建的文件复制粘贴到新的编辑器创建的文件中,我解决了我的问题。

This happened to me when I created a CSV file with OpenOffice Calc. It didn’t happen when I created the CSV file in my text editor, even if I later edited it with Calc.

I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.


回答 11

我在打开从Web服务生成的CSV时遇到了同样的问题,该服务在空标题中插入了NULL字节。我做了以下清理文件:

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    data = myfile.read()
    # clean file first if dirty
    if data.count( '\x00' ):
        print 'Cleaning...'
        with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
            for line in data:
                of.write(line.replace('\x00', ''))

        shutil.move( 'my.csv.tmp', 'my.csv' )

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    myreader = csv.reader(myfile, delimiter=',')
    # Continue with your business logic here...

免责声明:请注意,这会覆盖您的原始数据。确保您拥有它的备份副本。你被警告了!

I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    data = myfile.read()
    # clean file first if dirty
    if data.count( '\x00' ):
        print 'Cleaning...'
        with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
            for line in data:
                of.write(line.replace('\x00', ''))

        shutil.move( 'my.csv.tmp', 'my.csv' )

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    myreader = csv.reader(myfile, delimiter=',')
    # Continue with your business logic here...

Disclaimer: Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!


回答 12

对于所有那些讨厌的“ rU”文件模式的人:我只是尝试从Mac上的Windows计算机上使用“ rb”文件模式打开CSV文件,而我从csv模块中得到了此错误:

Error: new-line character seen in unquoted field - do you need to 
open the file in universal-newline mode?

以“ rU”模式打开文件可以正常工作。我喜欢通用换行模式-它为我省了很多麻烦。

For all those ‘rU’ filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the ‘rb’ filemode and I got this error from the csv module:

Error: new-line character seen in unquoted field - do you need to 
open the file in universal-newline mode?

Opening the file in ‘rU’ mode works fine. I love universal-newline mode — it saves me so much hassle.


回答 13

我在使用scrapy并获取压缩的csvfile时遇到此问题,而没有正确的中间件将响应主体解压缩,然后再将其交给csvreader。因此,该文件不是真正的csv文件,因此引发了line contains NULL byte错误。

I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.


回答 14

您是否尝试过使用gzip.open?

with gzip.open('my.csv', 'rb') as data_file:

我试图打开一个已压缩但扩展名为“ .csv”而不是“ csv.gz”的文件。在我使用gzip.open之前,此错误一直显示

Have you tried using gzip.open?

with gzip.open('my.csv', 'rb') as data_file:

I was trying to open a file that had been compressed but had the extension ‘.csv’ instead of ‘csv.gz’. This error kept showing up until I used gzip.open


回答 15

一种情况是-如果CSV文件包含空行,则可能会显示此错误。在继续写或读之前,请检查行是否必要。

for row in csvreader:
        if (row):       
            do something

我通过在代码中添加此检查解决了我的问题。

One case is that – If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.

for row in csvreader:
        if (row):       
            do something

I solved my issue by adding this check in the code.


如何将CSV文件转换为多行JSON?

问题:如何将CSV文件转换为多行JSON?

这是我的代码,非常简单的东西…

import csv
import json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("FirstName","LastName","IDNumber","Message")
reader = csv.DictReader( csvfile, fieldnames)
out = json.dumps( [ row for row in reader ] )
jsonfile.write(out)

声明一些字段名称,阅读器使用CSV读取文件,并使用字段名称将文件转储为JSON格式。这是问题所在…

CSV文件中的每个记录都在不同的行上。我希望JSON输出采用相同的方式。问题是它把所有的东西都丢在一条长长的长线上。

我试过使用类似的for line in csvfile:代码,然后在该代码下面运行我的代码,reader = csv.DictReader( line, fieldnames)该代码循环遍历每一行,但它在一行上执行整个文件,然后在另一行上遍历整个文件…继续直到行数结束。

有任何纠正建议吗?

编辑:澄清一下,目前我有:(第1行的每条记录)

[{"FirstName":"John","LastName":"Doe","IDNumber":"123","Message":"None"},{"FirstName":"George","LastName":"Washington","IDNumber":"001","Message":"Something"}]

我正在寻找的是:(2条记录中的2条记录)

{"FirstName":"John","LastName":"Doe","IDNumber":"123","Message":"None"}
{"FirstName":"George","LastName":"Washington","IDNumber":"001","Message":"Something"}

不是每个单独的字段缩进/在单独的行上缩进,而是每个记录都在其自己的行上。

一些样本输入。

"John","Doe","001","Message1"
"George","Washington","002","Message2"

Here’s my code, really simple stuff…

import csv
import json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("FirstName","LastName","IDNumber","Message")
reader = csv.DictReader( csvfile, fieldnames)
out = json.dumps( [ row for row in reader ] )
jsonfile.write(out)

Declare some field names, the reader uses CSV to read the file, and the filed names to dump the file to a JSON format. Here’s the problem…

Each record in the CSV file is on a different row. I want the JSON output to be the same way. The problem is it dumps it all on one giant, long line.

I’ve tried using something like for line in csvfile: and then running my code below that with reader = csv.DictReader( line, fieldnames) which loops through each line, but it does the entire file on one line, then loops through the entire file on another line… continues until it runs out of lines.

Any suggestions for correcting this?

Edit: To clarify, currently I have: (every record on line 1)

[{"FirstName":"John","LastName":"Doe","IDNumber":"123","Message":"None"},{"FirstName":"George","LastName":"Washington","IDNumber":"001","Message":"Something"}]

What I’m looking for: (2 records on 2 lines)

{"FirstName":"John","LastName":"Doe","IDNumber":"123","Message":"None"}
{"FirstName":"George","LastName":"Washington","IDNumber":"001","Message":"Something"}

Not each individual field indented/on a separate line, but each record on it’s own line.

Some sample input.

"John","Doe","001","Message1"
"George","Washington","002","Message2"

回答 0

您所需输出的问题是它不是有效的json文档;这是json文档流

没关系,如果您需要的话,但这意味着对于输出中想要的每个文档,您都必须调用json.dumps

由于您要分隔文档的换行符不包含在这些文档中,因此您需要自己提供它。因此,我们只需要从对json.dump的调用中拉出循环,并为每个编写的文档插入换行符即可。

import csv
import json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("FirstName","LastName","IDNumber","Message")
reader = csv.DictReader( csvfile, fieldnames)
for row in reader:
    json.dump(row, jsonfile)
    jsonfile.write('\n')

The problem with your desired output is that it is not valid json document,; it’s a stream of json documents!

That’s okay, if its what you need, but that means that for each document you want in your output, you’ll have to call json.dumps.

Since the newline you want separating your documents is not contained in those documents, you’re on the hook for supplying it yourself. So we just need to pull the loop out of the call to json.dump and interpose newlines for each document written.

import csv
import json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("FirstName","LastName","IDNumber","Message")
reader = csv.DictReader( csvfile, fieldnames)
for row in reader:
    json.dump(row, jsonfile)
    jsonfile.write('\n')

回答 1

您可以通过以下示例使用Pandas DataFrame实现此目的:

import pandas as pd
csv_file = pd.DataFrame(pd.read_csv("path/to/file.csv", sep = ",", header = 0, index_col = False))
csv_file.to_json("/path/to/new/file.json", orient = "records", date_format = "epoch", double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None)

You can use Pandas DataFrame to achieve this, with the following Example:

import pandas as pd
csv_file = pd.DataFrame(pd.read_csv("path/to/file.csv", sep = ",", header = 0, index_col = False))
csv_file.to_json("/path/to/new/file.json", orient = "records", date_format = "epoch", double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None)

回答 2

我接受了@SingleNegationElimination的响应,并将其简化为可以在管道中使用的三层:

import csv
import json
import sys

for row in csv.DictReader(sys.stdin):
    json.dump(row, sys.stdout)
    sys.stdout.write('\n')

I took @SingleNegationElimination’s response and simplified it into a three-liner that can be used in a pipeline:

import csv
import json
import sys

for row in csv.DictReader(sys.stdin):
    json.dump(row, sys.stdout)
    sys.stdout.write('\n')

回答 3

import csv
import json

file = 'csv_file_name.csv'
json_file = 'output_file_name.json'

#Read CSV File
def read_CSV(file, json_file):
    csv_rows = []
    with open(file) as csvfile:
        reader = csv.DictReader(csvfile)
        field = reader.fieldnames
        for row in reader:
            csv_rows.extend([{field[i]:row[field[i]] for i in range(len(field))}])
        convert_write_json(csv_rows, json_file)

#Convert csv data into json
def convert_write_json(data, json_file):
    with open(json_file, "w") as f:
        f.write(json.dumps(data, sort_keys=False, indent=4, separators=(',', ': '))) #for pretty
        f.write(json.dumps(data))


read_CSV(file,json_file)

json.dumps()的文档

import csv
import json

file = 'csv_file_name.csv'
json_file = 'output_file_name.json'

#Read CSV File
def read_CSV(file, json_file):
    csv_rows = []
    with open(file) as csvfile:
        reader = csv.DictReader(csvfile)
        field = reader.fieldnames
        for row in reader:
            csv_rows.extend([{field[i]:row[field[i]] for i in range(len(field))}])
        convert_write_json(csv_rows, json_file)

#Convert csv data into json
def convert_write_json(data, json_file):
    with open(json_file, "w") as f:
        f.write(json.dumps(data, sort_keys=False, indent=4, separators=(',', ': '))) #for pretty
        f.write(json.dumps(data))


read_CSV(file,json_file)

Documentation of json.dumps()


回答 4

你可以试试这个

import csvmapper

# how does the object look
mapper = csvmapper.DictMapper([ 
  [ 
     { 'name' : 'FirstName'},
     { 'name' : 'LastName' },
     { 'name' : 'IDNumber', 'type':'int' },
     { 'name' : 'Messages' }
  ]
 ])

# parser instance
parser = csvmapper.CSVParser('sample.csv', mapper)
# conversion service
converter = csvmapper.JSONConverter(parser)

print converter.doConvert(pretty=True)

编辑:

更简单的方法

import csvmapper

fields = ('FirstName', 'LastName', 'IDNumber', 'Messages')
parser = CSVParser('sample.csv', csvmapper.FieldMapper(fields))

converter = csvmapper.JSONConverter(parser)

print converter.doConvert(pretty=True)

You can try this

import csvmapper

# how does the object look
mapper = csvmapper.DictMapper([ 
  [ 
     { 'name' : 'FirstName'},
     { 'name' : 'LastName' },
     { 'name' : 'IDNumber', 'type':'int' },
     { 'name' : 'Messages' }
  ]
 ])

# parser instance
parser = csvmapper.CSVParser('sample.csv', mapper)
# conversion service
converter = csvmapper.JSONConverter(parser)

print converter.doConvert(pretty=True)

Edit:

Simpler approach

import csvmapper

fields = ('FirstName', 'LastName', 'IDNumber', 'Messages')
parser = CSVParser('sample.csv', csvmapper.FieldMapper(fields))

converter = csvmapper.JSONConverter(parser)

print converter.doConvert(pretty=True)

回答 5

indent参数添加到json.dumps

 data = {'this': ['has', 'some', 'things'],
         'in': {'it': 'with', 'some': 'more'}}
 print(json.dumps(data, indent=4))

另请注意,您可以简单地使用json.dumpopen jsonfile

json.dump(data, jsonfile)

Add the indent parameter to json.dumps

 data = {'this': ['has', 'some', 'things'],
         'in': {'it': 'with', 'some': 'more'}}
 print(json.dumps(data, indent=4))

Also note that, you can simply use json.dump with the open jsonfile:

json.dump(data, jsonfile)

回答 6

我看到这很旧,但是我需要来自SingleNegationElimination的代码,但是包含非utf-8字符的数据存在问题。这些出现在我不太关心的领域中,因此我选择忽略它们。但是,这需要一些努力。我是python的新手,因此经过反复试验后,我开始使用它。该代码是SingleNegationElimination的副本,带有utf-8的额外处理。我试图用https://docs.python.org/2.7/library/csv.html做到这一点,但最终放弃了。下面的代码工作。

import csv, json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("Scope","Comment","OOS Code","In RMF","Code","Status","Name","Sub Code","CAT","LOB","Description","Owner","Manager","Platform Owner")
reader = csv.DictReader(csvfile , fieldnames)

code = ''
for row in reader:
    try:
        print('+' + row['Code'])
        for key in row:
            row[key] = row[key].decode('utf-8', 'ignore').encode('utf-8')      
        json.dump(row, jsonfile)
        jsonfile.write('\n')
    except:
        print('-' + row['Code'])
        raise

I see this is old but I needed the code from SingleNegationElimination however I had issue with the data containing non utf-8 characters. These appeared in fields I was not overly concerned with so I chose to ignore them. However that took some effort. I am new to python so with some trial and error I got it to work. The code is a copy of SingleNegationElimination with the extra handling of utf-8. I tried to do it with https://docs.python.org/2.7/library/csv.html but in the end gave up. The below code worked.

import csv, json

csvfile = open('file.csv', 'r')
jsonfile = open('file.json', 'w')

fieldnames = ("Scope","Comment","OOS Code","In RMF","Code","Status","Name","Sub Code","CAT","LOB","Description","Owner","Manager","Platform Owner")
reader = csv.DictReader(csvfile , fieldnames)

code = ''
for row in reader:
    try:
        print('+' + row['Code'])
        for key in row:
            row[key] = row[key].decode('utf-8', 'ignore').encode('utf-8')      
        json.dump(row, jsonfile)
        jsonfile.write('\n')
    except:
        print('-' + row['Code'])
        raise

回答 7

如何使用Pandas将csv文件读入DataFrame(pd.read_csv),然后根据需要操纵列(删除它们或更新值),最后将DataFrame转换回JSON(pd.DataFrame.to_json)。

注意:我还没有检查过效率如何,但这绝对是处理大型csv并将其转换为json的最简单方法之一。

How about using Pandas to read the csv file into a DataFrame (pd.read_csv), then manipulating the columns if you want (dropping them or updating values) and finally converting the DataFrame back to JSON (pd.DataFrame.to_json).

Note: I haven’t checked how efficient this will be but this is definitely one of the easiest ways to manipulate and convert a large csv to json.


回答 8

作为@MONTYHS答案的略微改进,通过一堆字段名进行迭代:

import csv
import json

csvfilename = 'filename.csv'
jsonfilename = csvfilename.split('.')[0] + '.json'
csvfile = open(csvfilename, 'r')
jsonfile = open(jsonfilename, 'w')
reader = csv.DictReader(csvfile)

fieldnames = ('FirstName', 'LastName', 'IDNumber', 'Message')

output = []

for each in reader:
  row = {}
  for field in fieldnames:
    row[field] = each[field]
output.append(row)

json.dump(output, jsonfile, indent=2, sort_keys=True)

As slight improvement to @MONTYHS answer, iterating through a tup of fieldnames:

import csv
import json

csvfilename = 'filename.csv'
jsonfilename = csvfilename.split('.')[0] + '.json'
csvfile = open(csvfilename, 'r')
jsonfile = open(jsonfilename, 'w')
reader = csv.DictReader(csvfile)

fieldnames = ('FirstName', 'LastName', 'IDNumber', 'Message')

output = []

for each in reader:
  row = {}
  for field in fieldnames:
    row[field] = each[field]
output.append(row)

json.dump(output, jsonfile, indent=2, sort_keys=True)

回答 9

import csv
import json
csvfile = csv.DictReader('filename.csv', 'r'))
output =[]
for each in csvfile:
    row ={}
    row['FirstName'] = each['FirstName']
    row['LastName']  = each['LastName']
    row['IDNumber']  = each ['IDNumber']
    row['Message']   = each['Message']
    output.append(row)
json.dump(output,open('filename.json','w'),indent=4,sort_keys=False)
import csv
import json
csvfile = csv.DictReader('filename.csv', 'r'))
output =[]
for each in csvfile:
    row ={}
    row['FirstName'] = each['FirstName']
    row['LastName']  = each['LastName']
    row['IDNumber']  = each ['IDNumber']
    row['Message']   = each['Message']
    output.append(row)
json.dump(output,open('filename.json','w'),indent=4,sort_keys=False)