标签归档:xlrd

熊猫:在Excel文件中查找工作表列表

问题:熊猫:在Excel文件中查找工作表列表

新版本的Pandas使用以下界面加载Excel文件:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

但是,如果我不知道可用的图纸怎么办?

例如,我正在使用以下工作表的excel文件

数据1,数据2 …,数据N,foo,bar

但我不知道N先验。

有什么方法可以从Pandas的excel文档中获取工作表列表吗?

The new version of Pandas uses the following interface to load Excel files:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

but what if I don’t know the sheets that are available?

For example, I am working with excel files that the following sheets

Data 1, Data 2 …, Data N, foo, bar

but I don’t know N a priori.

Is there any way to get the list of sheets from an excel document in Pandas?


回答 0

您仍然可以使用ExcelFile类(和sheet_names属性):

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

有关更多选项,请参阅文档以进行解析

You can still use the ExcelFile class (and the sheet_names attribute):

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

see docs for parse for more options…


回答 1

您应该将第二个参数(工作表名称)明确指定为“无”。像这样:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

“ df”都是作为DataFrames字典的工作表,您可以通过运行以下命令进行验证:

df.keys()

结果是这样的:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

请参阅pandas doc了解更多详细信息: https //pandas.pydata.org/pandas-docs/stable/generation/pandas.read_excel.html

You should explicitly specify the second parameter (sheetname) as None. like this:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

“df” are all sheets as a dictionary of DataFrames, you can verify it by run this:

df.keys()

result like this:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html


回答 2

这是我发现最快的方法,灵感来自@divingTobi的答案。所有基于xlrd,openpyxl或pandas的答案对我来说都很慢,因为它们都首先加载整个文件。

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]

This is the fastest way I have found, inspired by @divingTobi’s answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]


回答 3

以@dhwanil_shah的答案为基础,您不需要提取整个文件。有了zf.open它,可以直接从一个压缩文件中读取。

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

连续两个 readline s很难看,但内容仅在文本的第二行中。无需解析整个文件。

该解决方案似乎比该read_excel版本要快得多,而且很有可能比完整提取版本还快。

Building on @dhwanil_shah ‘s answer, you do not need to extract the whole file. With zf.open it is possible to read from a zipped file directly.

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

The two consecutive readlines are ugly, but the content is only in the second line of the text. No need to parse the whole file.

This solution seems to be much faster than the read_excel version, and most likely also faster than the full extract version.


回答 4

我已经尝试过xlrd,pandas,openpyxl和其他类似的库,并且随着读取整个文件时文件大小的增加,它们似乎都花费了指数时间。上面提到的其他使用’on_demand’的解决方案对我不起作用。如果只想最初获取工作表名称,则以下功能适用于xlsx文件。

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

由于所有xlsx基本上都是压缩文件,因此我们提取基本的xml数据并直接从工作簿中读取工作表名称,与库函数相比,此过程只需花费一秒钟的时间。

基准测试:(在具有4张纸的
6mb xlsx文件上)Pandas,xlrd: 12秒
openpyxl: 24秒
建议的方法: 0.4秒

由于我的要求只是读取工作表名称,因此读取整个时间不必要的开销困扰着我,所以我改用了这种方法。

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used ‘on_demand’ did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.


回答 5

from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

对于我正在使用的5MB Excel文件,load_workbook没有read_only标记花费了8.24秒。带有read_only标志,仅花费了39.6 ms。如果您仍然想使用Excel库而不是使用xml解决方案,那将比解析整个文件的方法快得多。

from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

For a 5MB Excel file I’m working with, load_workbook without the read_only flag took 8.24s. With the read_only flag it only took 39.6 ms. If you still want to use an Excel library and not drop to an xml solution, that’s much faster than the methods that parse the whole file.


使用Python列表中的值创建.csv文件

问题:使用Python列表中的值创建.csv文件

我正在尝试使用Python列表中的值创建一个.csv文件。当我在列表中打印值时,它们都是unicode(?),即它们看起来像这样

[u'value 1', u'value 2', ...]

如果我遍历列表中的值,即for v in mylist: print v它们似乎是纯文本。

我可以,在每个与print ','.join(mylist)

我可以输出到文件,即

myfile = open(...)
print >>myfile, ','.join(mylist)

但是我想输出到CSV并在列表中的值周围有定界符,例如

"value 1", "value 2", ... 

我找不到在格式中包含定界符的简单方法,例如,我已经尝试过该join语句。我怎样才能做到这一点?

I am trying to create a .csv file with the values from a Python list. When I print the values in the list they are all unicode (?), i.e. they look something like this

[u'value 1', u'value 2', ...]

If I iterate through the values in the list i.e. for v in mylist: print v they appear to be plain text.

And I can put a , between each with print ','.join(mylist)

And I can output to a file, i.e.

myfile = open(...)
print >>myfile, ','.join(mylist)

But I want to output to a CSV and have delimiters around the values in the list e.g.

"value 1", "value 2", ... 

I can’t find an easy way to include the delimiters in the formatting, e.g. I have tried through the join statement. How can I do this?


回答 0

import csv

with open(..., 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

编辑:这仅适用于python2.x。

为了使其与python 3.x wb一起工作,替换为w请参阅此SO答案

with open(..., 'w', newline='') as myfile:
     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
     wr.writerow(mylist)
import csv

with open(..., 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

Edit: this only works with python 2.x.

To make it work with python 3.x replace wb with w (see this SO answer)

with open(..., 'w', newline='') as myfile:
     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
     wr.writerow(mylist)

回答 1

这是Alex Martelli的安全版本:

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

Here is a secure version of Alex Martelli’s:

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(mylist)

回答 2

对于另一种方法,可以在pandas中使用DataFrame:它可以轻松地将数据转储到csv中,就像下面的代码一样:

import pandas
df = pandas.DataFrame(data={"col1": list_1, "col2": list_2})
df.to_csv("./file.csv", sep=',',index=False)

For another approach, you can use DataFrame in pandas: And it can easily dump the data to csv just like the code below:

import pandas
df = pandas.DataFrame(data={"col1": list_1, "col2": list_2})
df.to_csv("./file.csv", sep=',',index=False)

回答 3

我发现最好的选择是使用savetxt来自numpy模块的

import numpy as np
np.savetxt("file_name.csv", data1, delimiter=",", fmt='%s', header=header)

如果您有多个列表需要堆叠

np.savetxt("file_name.csv", np.column_stack((data1, data2)), delimiter=",", fmt='%s', header=header)

The best option I’ve found was using the savetxt from the numpy module:

import numpy as np
np.savetxt("file_name.csv", data1, delimiter=",", fmt='%s', header=header)

In case you have multiple lists that need to be stacked

np.savetxt("file_name.csv", np.column_stack((data1, data2)), delimiter=",", fmt='%s', header=header)

回答 4

使用python的csv模块读取和写入逗号或制表符分隔的文件。首选csv模块,因为它可以使您更好地控制报价。

例如,这是为您准备的示例:

import csv
data = ["value %d" % i for i in range(1,4)]

out = csv.writer(open("myfile.csv","w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)

生成:

"value 1","value 2","value 3"

Use python’s csv module for reading and writing comma or tab-delimited files. The csv module is preferred because it gives you good control over quoting.

For example, here is the worked example for you:

import csv
data = ["value %d" % i for i in range(1,4)]

out = csv.writer(open("myfile.csv","w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)

Produces:

"value 1","value 2","value 3"

回答 5

在这种情况下,您可以使用string.join方法。

为了清晰起见,请分成几行-这是一个互动式会议

>>> a = ['a','b','c']
>>> first = '", "'.join(a)
>>> second = '"%s"' % first
>>> print second
"a", "b", "c"

或单行

>>> print ('"%s"') % '", "'.join(a)
"a", "b", "c"

但是,您可能会遇到问题,因为您的字符串具有嵌入的引号。如果是这种情况,则需要决定如何对其进行转义。

CSV模块可以照顾这一切为您,让您在各种报价选项中进行选择(所有领域,只能用引号和分隔符,唯一的非数字字段等字段),以及如何esacpe控制charecters(双引号,或转义的字符串)。如果您的值很简单,则string.join可能会没问题,但是如果您必须管理很多边缘情况,请使用可用的模块。

You could use the string.join method in this case.

Split over a few of lines for clarity – here’s an interactive session

>>> a = ['a','b','c']
>>> first = '", "'.join(a)
>>> second = '"%s"' % first
>>> print second
"a", "b", "c"

Or as a single line

>>> print ('"%s"') % '", "'.join(a)
"a", "b", "c"

However, you may have a problem is your strings have got embedded quotes. If this is the case you’ll need to decide how to escape them.

The CSV module can take care of all of this for you, allowing you to choose between various quoting options (all fields, only fields with quotes and seperators, only non numeric fields, etc) and how to esacpe control charecters (double quotes, or escaped strings). If your values are simple, string.join will probably be OK but if you’re having to manage lots of edge cases, use the module available.


回答 6

这个解决方案听起来很疯狂,但是像蜂蜜一样平稳

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
    wr.writerow(mylist)

该文件是由csvwriter写入的,因此csv属性得以保持,即逗号分隔。分隔符通过将列表项每次移至下一行来为主体提供帮助。

This solutions sounds crazy, but works smooth as honey

import csv

with open('filename', 'wb') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL,delimiter='\n')
    wr.writerow(mylist)

The file is being written by csvwriter hence csv properties are maintained i.e. comma separated. The delimiter helps in the main part by moving list items to next line, each time.


回答 7

创建并写入csv文件

下面的示例演示如何创建和写入一个csv文件。要创建动态文件编写器,我们需要导入一个包import csv,然后需要使用open(“ D:\ sample.csv”,“ w”,newline =“”创建文件引用为Ex:-的文件实例。)作为file_writer

如果该文件不存在上述文件目录,则python将在指定目录中创建同一文件,“ w”代表写入,如果要读取文件,则将“ w”替换为“ r”或附加到现有文件,然后单击“ a”。newline =“”表示每次创建行时都会删除一个多余的空行,因此要消除空行,我们使用newline =“”,并使用诸如fields = [“ Names”,“ Age “,” Class“],然后 在此处使用Dictionary writer并分配列名,将其应用于writer实例,例如 writer = csv.DictWriter(file_writer,fieldnames = fields),以便将列名写入使用csv的csv中 ,而写入文件的值必须使用字典方法传递,这里的键是列名,而值是您各自的键值

import csv 

with open("D:\\sample.csv","w",newline="") as file_writer:

   fields=["Names","Age","Class"]

   writer=csv.DictWriter(file_writer,fieldnames=fields)

   writer.writeheader()

   writer.writerow({"Names":"John","Age":21,"Class":"12A"})

To create and write into a csv file

The below example demonstrate creating and writing a csv file. to make a dynamic file writer we need to import a package import csv, then need to create an instance of the file with file reference Ex:- with open(“D:\sample.csv”,”w”,newline=””) as file_writer

here if the file does not exist with the mentioned file directory then python will create a same file in the specified directory, and “w” represents write, if you want to read a file then replace “w” with “r” or to append to existing file then “a”. newline=”” specifies that it removes an extra empty row for every time you create row so to eliminate empty row we use newline=””, create some field names(column names) using list like fields=[“Names”,”Age”,”Class”], then apply to writer instance like writer=csv.DictWriter(file_writer,fieldnames=fields) here using Dictionary writer and assigning column names, to write column names to csv we use writer.writeheader() and to write values we use writer.writerow({“Names”:”John”,”Age”:20,”Class”:”12A”}) ,while writing file values must be passed using dictionary method , here the key is column name and value is your respective key value

import csv 

with open("D:\\sample.csv","w",newline="") as file_writer:

   fields=["Names","Age","Class"]

   writer=csv.DictWriter(file_writer,fieldnames=fields)

   writer.writeheader()

   writer.writerow({"Names":"John","Age":21,"Class":"12A"})

回答 8

Jupyter笔记本

假设您的清单是 A

然后,您可以编码以下广告,将其作为csv文件保存(仅列!)

R="\n".join(A)
f = open('Columns.csv','w')
f.write(R)
f.close()

Jupyter notebook

Lets say that your list is A

Then you can code the following ad you will have it as a csv file (columns only!)

R="\n".join(A)
f = open('Columns.csv','w')
f.write(R)
f.close()

回答 9

您应该确定使用CSV模块,但是有可能需要编写unicode。对于那些需要编写unicode的人,这是示例页面中的类,您可以将其用作util模块:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

def __iter__(self):
    return self

def next(self):
    return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    f = UTF8Recoder(f, encoding)
    self.reader = csv.reader(f, dialect=dialect, **kwds)

def next(self):
    row = self.reader.next()
    return [unicode(s, "utf-8") for s in row]

def __iter__(self):
    return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    # Redirect output to a queue
    self.queue = cStringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f
    self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
    self.writer.writerow([s.encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = self.encoder.encode(data)
    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

you should use the CSV module for sure , but the chances are , you need to write unicode . For those Who need to write unicode , this is the class from example page , that you can use as a util module:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

def __iter__(self):
    return self

def next(self):
    return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    f = UTF8Recoder(f, encoding)
    self.reader = csv.reader(f, dialect=dialect, **kwds)

def next(self):
    row = self.reader.next()
    return [unicode(s, "utf-8") for s in row]

def __iter__(self):
    return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    # Redirect output to a queue
    self.queue = cStringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f
    self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
    self.writer.writerow([s.encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = self.encoder.encode(data)
    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

回答 10

这是不需要csv模块的另一种解决方案。

print ', '.join(['"'+i+'"' for i in myList])

范例:

>>> myList = [u'value 1', u'value 2', u'value 3']
>>> print ', '.join(['"'+i+'"' for i in myList])
"value 1", "value 2", "value 3"

但是,如果初始列表包含一些“”,则不会对其进行转义。如果需要,可以调用一个函数来对其进行转义,如下所示:

print ', '.join(['"'+myFunction(i)+'"' for i in myList])

Here is another solution that does not require the csv module.

print ', '.join(['"'+i+'"' for i in myList])

Example :

>>> myList = [u'value 1', u'value 2', u'value 3']
>>> print ', '.join(['"'+i+'"' for i in myList])
"value 1", "value 2", "value 3"

However, if the initial list contains some “, they will not be escaped. If it is required, it is possible to call a function to escape it like that :

print ', '.join(['"'+myFunction(i)+'"' for i in myList])