标签归档:pandas

将x和y标签添加到熊猫图

问题:将x和y标签添加到熊猫图

假设我有以下代码使用pandas绘制了一些非常简单的图形:

import pandas as pd
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10, 
         title='Video streaming dropout by category')

如何在保留我使用特定颜色图的能力的同时轻松设置x和y标签?我注意到,plot()pandas DataFrames 的包装没有采用任何特定于此的参数。

Suppose I have the following code that plots something very simple using pandas:

import pandas as pd
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10, 
         title='Video streaming dropout by category')

How do I easily set x and y-labels while preserving my ability to use specific colormaps? I noticed that the plot() wrapper for pandas DataFrames doesn’t take any parameters specific for that.


回答 0

df.plot()函数返回一个matplotlib.axes.AxesSubplot对象。您可以在该对象上设置标签。

ax = df2.plot(lw=2, colormap='jet', marker='.', markersize=10, title='Video streaming dropout by category')
ax.set_xlabel("x label")
ax.set_ylabel("y label")

或者,更简洁地说:ax.set(xlabel="x label", ylabel="y label")

或者,索引x轴标签(如果有的话)会自动设置为索引名称。所以df2.index.name = 'x label'也可以。

The df.plot() function returns a matplotlib.axes.AxesSubplot object. You can set the labels on that object.

ax = df2.plot(lw=2, colormap='jet', marker='.', markersize=10, title='Video streaming dropout by category')
ax.set_xlabel("x label")
ax.set_ylabel("y label")

Or, more succinctly: ax.set(xlabel="x label", ylabel="y label").

Alternatively, the index x-axis label is automatically set to the Index name, if it has one. so df2.index.name = 'x label' would work too.


回答 1

您可以像这样使用它:

import matplotlib.pyplot as plt 
import pandas as pd

plt.figure()
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10,
         title='Video streaming dropout by category')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

显然,您必须将字符串’xlabel’和’ylabel’替换为您想要的名称。

You can use do it like this:

import matplotlib.pyplot as plt 
import pandas as pd

plt.figure()
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10,
         title='Video streaming dropout by category')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

Obviously you have to replace the strings ‘xlabel’ and ‘ylabel’ with what you want them to be.


回答 2

如果您为DataFrame的列和索引添加标签,熊猫将自动提供适当的标签:

import pandas as pd
values = [[1, 2], [2, 5]]
df = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                  index=['Index 1', 'Index 2'])
df.columns.name = 'Type'
df.index.name = 'Index'
df.plot(lw=2, colormap='jet', marker='.', markersize=10, 
        title='Video streaming dropout by category')

在这种情况下,您仍然需要手动提供y标签(例如,通过plt.ylabel其他答案所示)。

If you label the columns and index of your DataFrame, pandas will automatically supply appropriate labels:

import pandas as pd
values = [[1, 2], [2, 5]]
df = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                  index=['Index 1', 'Index 2'])
df.columns.name = 'Type'
df.index.name = 'Index'
df.plot(lw=2, colormap='jet', marker='.', markersize=10, 
        title='Video streaming dropout by category')

In this case, you’ll still need to supply y-labels manually (e.g., via plt.ylabel as shown in the other answers).


回答 3

可以同时设置两个标签和axis.set功能。查找示例:

import pandas as pd
import matplotlib.pyplot as plt
values = [[1,2], [2,5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])
ax = df2.plot(lw=2,colormap='jet',marker='.',markersize=10,title='Video streaming dropout by category')
# set labels for both axes
ax.set(xlabel='x axis', ylabel='y axis')
plt.show()

It is possible to set both labels together with axis.set function. Look for the example:

import pandas as pd
import matplotlib.pyplot as plt
values = [[1,2], [2,5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])
ax = df2.plot(lw=2,colormap='jet',marker='.',markersize=10,title='Video streaming dropout by category')
# set labels for both axes
ax.set(xlabel='x axis', ylabel='y axis')
plt.show()


回答 4

对于您使用的情况pandas.DataFrame.hist

plt = df.Column_A.hist(bins=10)

请注意,您得到的是图的阵列,而不是图。因此,要设置x标签,您将需要执行以下操作

plt[0][0].set_xlabel("column A")

For cases where you use pandas.DataFrame.hist:

plt = df.Column_A.hist(bins=10)

Note that you get an ARRAY of plots, rather than a plot. Thus to set the x label you will need to do something like this

plt[0][0].set_xlabel("column A")

回答 5

关于什么 …

import pandas as pd
import matplotlib.pyplot as plt

values = [[1,2], [2,5]]

df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])

(df2.plot(lw=2,
          colormap='jet',
          marker='.',
          markersize=10,
          title='Video streaming dropout by category')
    .set(xlabel='x axis',
         ylabel='y axis'))

plt.show()

what about …

import pandas as pd
import matplotlib.pyplot as plt

values = [[1,2], [2,5]]

df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])

(df2.plot(lw=2,
          colormap='jet',
          marker='.',
          markersize=10,
          title='Video streaming dropout by category')
    .set(xlabel='x axis',
         ylabel='y axis'))

plt.show()

回答 6

pandas使用matplotlib基本数据帧图。因此,如果您pandas用于基本绘图,则可以使用matplotlib进行绘图自定义。但是,我在这里提出了一种替代方法,使用seaborn该方法可以对图进行更多的自定义,而不必进入的基本层次matplotlib

工作代码:

import pandas as pd
import seaborn as sns
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
ax= sns.lineplot(data=df2, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='Video streaming dropout by category') 

pandas uses matplotlib for basic dataframe plots. So, if you are using pandas for basic plot you can use matplotlib for plot customization. However, I propose an alternative method here using seaborn which allows more customization of the plot while not going into the basic level of matplotlib.

Working Code:

import pandas as pd
import seaborn as sns
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
ax= sns.lineplot(data=df2, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='Video streaming dropout by category') 


如何使用pandas读取较大的csv文件?

问题:如何使用pandas读取较大的csv文件?

我试图在熊猫中读取较大的csv文件(大约6 GB),但出现内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

有什么帮助吗?

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

Any help on this?


回答 0

该错误表明机器没有足够的内存来一次将整个CSV读入DataFrame。假设您一次也不需要整个内存中的整个数据集,一种避免问题的方法是分批处理CSV(通过指定chunksize参数):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

chunksize参数指定每个块的行数。(当然,最后一块可能少于chunksize行。)

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)


回答 1

分块不一定总是解决此问题的第一站。

  1. 文件是否由于重复的非数字数据或不需要的列而变大?

    如果是这样,您有时可以通过读取列作为类别并通过pd.read_csv usecols参数选择所需的列来节省大量内存。

  2. 您的工作流程是否需要切片,操作,导出?

    如果是这样,则可以使用dask.dataframe进行切片,执行计算并迭代导出。打包由dask静默执行,它也支持pandas API的子集。

  3. 如果所有其他方法均失败,请通过块逐行读取。

    作为最后手段,可以通过熊猫csv库进行分块。

Chunking shouldn’t always be the first port of call for this problem.

  1. Is the file large due to repeated non-numeric data or unwanted columns?

    If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

  2. Does your workflow require slicing, manipulating, exporting?

    If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

  3. If all else fails, read line by line via chunks.

    Chunk via pandas or via csv library as a last resort.


回答 2

我这样进行:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

回答 3

对于大数据,我建议您使用库“ dask”,
例如:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

您可以从此处阅读更多文档。

另一个很好的选择是使用modin,因为所有功能都与pandas相同,但它利用了dask等分布式数据框架库。

For large data l recommend you use the library “dask”
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentation here.

Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.


回答 4

上面的答案已经满足了这个主题。无论如何,如果您需要内存中的所有数据,请查看bcolz。它压缩内存中的数据。我有非常好的经验。但是它缺少许多熊猫功能

编辑:我得到的压缩率大约是我认为的1/10或原始大小,这当然取决于数据类型。缺少的重要功能是聚合。

The above answer is already satisfying the topic. Anyway, if you need all the data in memory – have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.


回答 5

您可以将数据读取为大块,并将每个大块另存为泡菜。

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

在下一步中,您将读取泡菜并将每个泡菜附加到所需的数据框中。

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

回答 6

函数read_csv和read_table几乎相同。但是,在程序中使用函数read_table时,必须分配定界符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

回答 7

解决方案1:

使用大数据的熊猫

解决方案2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

回答 8

下面是一个示例:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

Here follows an example:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

回答 9

您可以尝试sframe,它的语法与pandas相同,但允许您处理大于RAM的文件。

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.


回答 10

如果您使用熊猫将大文件读入块中,然后逐行产生,这就是我所做的

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

回答 11

我想根据已经提供的大多数潜在解决方案做出更全面的回答。我还想指出另一种可能有助于阅读过程的潜在帮助。

选项1:dtypes

“ dtypes”是一个非常强大的参数,可用于减少read方法的内存压力。看到这个这个答案。熊猫默认情况下会尝试推断数据的dtypes。

参照数据结构,存储的每个数据都会进行内存分配。在基本级别上,请参考以下值(下表说明了C编程语言的值):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

请参阅页面以查看NumPy和C类型之间的匹配。

假设您有一个由数字组成的整数数组。您可以在理论上和实践上都进行分配,比如说16位整数类型的数组,但是您分配的内存将比实际存储该数组所需的更多。为防止这种情况,您可以dtype在上设置选项read_csv。您不希望将数组项存储为长整数,而实际上可以使用8位整数(np.int8np.uint8)来使它们适合。

观察以下dtype映射。

资料来源:https : //pbpython.com/pandas_dtypes.html

您可以像在{column:type}一样将dtype参数作为参数传递给pandas方法read

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

选项2:大块读取

逐块读取数据使您可以访问内存中的部分数据,并且可以对数据进行预处理,并保留处理后的数据而不是原始数据。如果将此选项与第一个dtypes结合使用会更好。

我想指出该过程的“熊猫食谱”部分,您可以在这里找到它。注意那两个部分;

选项3:达斯

Dask是在Dask网站上定义为的框架:

Dask为分析提供高级并行性,从而为您喜欢的工具提供大规模性能

它的诞生是为了覆盖熊猫无法到达的必要部分。Dask是一个功能强大的框架,通过以分布式方式处理它,可以使您访问更多数据。

您可以使用dask预处理整个数据,Dask负责分块部分,因此与熊猫不同,您可以定义处理步骤并让Dask完成工作。Dask不会在compute和和/或显式推送计算之前应用这些计算persist(有关差异,请参见此处的答案)。

其他援助(想法)

  • 为数据设计的ETL流。仅保留原始数据中需要的内容。
    • 首先,使用Dask或PySpark之类的框架将ETL应用于整个数据,然后导出处理后的数据。
    • 然后查看处理后的数据是否可以整体容纳在内存中。
  • 考虑增加RAM。
  • 考虑在云平台上使用该数据。

I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

Option 1: dtypes

“dtypes” is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

Refer to this page to see the matching between NumPy and C types.

Let’s say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).

Observe the following dtype map.

Source: https://pbpython.com/pandas_dtypes.html

You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

Option 2: Read by Chunks

Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It’d be much better if you combine this option with the first one, dtypes.

I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;

Option 3: Dask

Dask is a framework that is defined in Dask’s website as:

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).

Other Aids (Ideas)

  • ETL flow designed for the data. Keeping only what is needed from the raw data.
    • First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
    • Then see if the processed data can be fit in the memory as a whole.
  • Consider increasing your RAM.
  • Consider working with that data on a cloud platform.

回答 12

除了上述答案之外,对于那些想要处理CSV然后导出到csv,镶木地板或SQL的用户来说,d6tstack是另一个不错的选择。您可以加载多个文件,并且它处理数据架构更改(添加/删除的列)。已经内置了核心支持之外的其他功能。

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

回答 13

如果有人仍在寻找这样的东西,我发现这个叫做modin的新库可以提供帮助。它使用可以帮助读取的分布式计算。这是一篇很好的文章,比较了它与熊猫的功能。它基本上使用与熊猫相同的功能。

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here’s a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

回答 14

在使用chunksize选项之前,如果要确定要在@unutbu所提到的分块for循环中编写的过程函数,可以简单地使用nrows选项。

small_df = pd.read_csv(filename, nrows=100)

一旦确定过程块已准备就绪,就可以将其放入整个数据帧的块循环中。

Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.

small_df = pd.read_csv(filename, nrows=100)

Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.


熊猫三向联接列上的多个数据框

问题:熊猫三向联接列上的多个数据框

我有3个CSV文件。每个列都有第一列作为人员的(字符串)名称,而每个数据框中的所有其他列都是该人员的属性。

如何将所有三个CSV文档“连接”在一起以创建一个CSV,而每一行都具有该人的字符串名称的每个唯一值的所有属性?

join()pandas中的函数指定我需要一个多索引,但是我对层次化索引方案与基于单个索引进行联接有何关系感到困惑。

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I “join” together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person’s string name?

The join() function in pandas specifies that I need a multiindex, but I’m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.


回答 0

假设进口:

import pandas as pd

John Galt的答案基本上是一项reduce手术。如果我有几个数据帧,则将它们放在这样的列表中(通过列表推导或循环或其他方式生成):

dfs = [df0, df1, df2, dfN]

假设它们有一些共同的列,例如name您的示例,我将执行以下操作:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

这样,您的代码应该可以与要合并的任意数量的数据框一起使用。

编辑2016年8月1日:对于使用Python 3的用户:reduce已移入functools。因此,要使用此功能,您首先需要导入该模块:

from functools import reduce

Assumed imports:

import pandas as pd

John Galt’s answer is basically a reduce operation. If I have more than a handful of dataframes, I’d put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, dfN]

Assuming they have some common column, like name in your example, I’d do the following:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Edit August 1, 2016: For those using Python 3: reduce has been moved into functools. So to use this function, you’ll first need to import that module:

from functools import reduce

回答 1

如果您有3个数据框,则可以尝试

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

或者,如cwharland所述

df1.merge(df2,on='name').merge(df3,on='name')

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

回答 2

这是该join方法的理想情况

join方法正是针对这些类型的情况而构建的。您可以将任意数量的DataFrame与其一起加入。调用DataFrame与传递的DataFrames集合的索引连接。要使用多个DataFrame,必须将联接列放在索引中。

代码看起来像这样:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

使用@zero的数据,您可以执行以下操作:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero’s data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

回答 3

对于数据帧列表,也可以按以下步骤进行操作df_list

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

或数据帧在生成器对象中(例如,以减少内存消耗):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

回答 4

python3.6.3和pandas0.22.0中concat,只要将要用于联接的列设置为索引,也可以使用

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

其中df1df2df3定义为John Galt的答案

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt’s answer

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

回答 5

一个并不需要一个多指标进行连接操作。只需正确设置要在其上执行联接操作的索引列(df.set_index('Name')例如,该命令)

join默认情况下,该操作是对索引执行的。对于您的情况,只需要指定该Name列对应于您的索引即可。下面是一个例子

教程可能是有用的。

# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

回答 6

这是一种合并数据帧字典,同时使列名与字典同步的方法。如果需要,它还会填写缺失值:

这是合并数据帧字典的功能

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

好的,让我们生成数据并进行测试:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

This is the function to merge a dict of data frames

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

回答 7

简单的解决方案:

如果列名相似:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

如果列名不同:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Simple Solution:

If the column names are similar:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

回答 8

pandas文档中还有另一种解决方案(我在这里看不到),

使用 .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

ignore_index=True被用来忽略所附数据帧的索引,在源一个可用下一个索引代替。

如果列名不同,Nan将引入。

There is another solution from the pandas documentation (that I don’t see here),

using the .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

If there are different column names, Nan will be introduced.


回答 9

这三个数据帧是

让我们使用嵌套的pd.merge合并这些框架

在这里,我们有了合并的数据框。

快乐的分析!!!

The three dataframes are

Let’s merge these frames using nested pd.merge

Here we go, we have our merged dataframe.

Happy Analysis!!!


Python Pandas从一列字符串的数据选择中过滤掉Nan

问题:Python Pandas从一列字符串的数据选择中过滤掉Nan

如果不使用groupby,我将如何过滤掉没有的数据NaN

假设我有一个矩阵,客户可以在其中填写“ N / A”,“ n / a”或其任何变体,而其他人则将其留空:

import pandas as pd
import numpy as np


df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
                  'rating': [3., 4., 5., np.nan, np.nan, np.nan],
                  'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]})

nbs = df['name'].str.extract('^(N/A|NA|na|n/a)')
nms=df[(df['name'] != nbs) ]

输出:

>>> nms
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

我将如何过滤NaN值,以便可以像这样获得结果:

  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

我猜我需要类似的东西,~np.isnan但tilda不适用于字符串。

Without using groupby how would I filter out data without NaN?

Let say I have a matrix where customers will fill in ‘N/A’,’n/a’ or any of its variations and others leave it blank:

import pandas as pd
import numpy as np


df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
                  'rating': [3., 4., 5., np.nan, np.nan, np.nan],
                  'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]})

nbs = df['name'].str.extract('^(N/A|NA|na|n/a)')
nms=df[(df['name'] != nbs) ]

output:

>>> nms
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

How would I filter out NaN values so I can get results to work with like this:

  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

I am guessing I need something like ~np.isnan but the tilda does not work with strings.


回答 0

只需将它们放下:

nms.dropna(thresh=2)

这将删除所有至少有两个non-的行NaN

然后,您可以删除名称为NaN

In [87]:

nms
Out[87]:
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

[5 rows x 3 columns]
In [89]:

nms = nms.dropna(thresh=2)
In [90]:

nms[nms.name.notnull()]
Out[90]:
  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

[2 rows x 3 columns]

编辑

实际查看您最初想要的是什么,而无需dropna调用即可:

nms[nms.name.notnull()]

更新

3年后的这个问题,有一个错误,首先thresharg至少查找nNaN值,因此实际上输出应为:

In [4]:
nms.dropna(thresh=2)

Out[4]:
  movie    name  rating
0   thg    John     3.0
1   thg     NaN     4.0
3   mol  Graham     NaN

我可能是3年前弄错了,或者我运行的熊猫版本存在错误,两种情况都是可能的。

Just drop them:

nms.dropna(thresh=2)

this will drop all rows where there are at least two non-NaN.

Then you could then drop where name is NaN:

In [87]:

nms
Out[87]:
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

[5 rows x 3 columns]
In [89]:

nms = nms.dropna(thresh=2)
In [90]:

nms[nms.name.notnull()]
Out[90]:
  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

[2 rows x 3 columns]

EDIT

Actually looking at what you originally want you can do just this without the dropna call:

nms[nms.name.notnull()]

UPDATE

Looking at this question 3 years later, there is a mistake, firstly thresh arg looks for at least n non-NaN values so in fact the output should be:

In [4]:
nms.dropna(thresh=2)

Out[4]:
  movie    name  rating
0   thg    John     3.0
1   thg     NaN     4.0
3   mol  Graham     NaN

It’s possible that I was either mistaken 3 years ago or that the version of pandas I was running had a bug, both scenarios are entirely possible.


回答 1

所有解决方案中最简单的:

filtered_df = df[df['name'].notnull()]

因此,它仅过滤掉“名称”列中没有NaN值的行。

对于多列:

filtered_df = df[df[['name', 'country', 'region']].notnull().all(1)]

Simplest of all solutions:

filtered_df = df[df['name'].notnull()]

Thus, it filters out only rows that doesn’t have NaN values in ‘name’ column.

For multiple columns:

filtered_df = df[df[['name', 'country', 'region']].notnull().all(1)]

回答 2

df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],'rating': [3., 4., 5., np.nan, np.nan, np.nan],'name': ['John','James', np.nan, np.nan, np.nan,np.nan]})

for col in df.columns:
    df = df[~pd.isnull(df[col])]
df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],'rating': [3., 4., 5., np.nan, np.nan, np.nan],'name': ['John','James', np.nan, np.nan, np.nan,np.nan]})

for col in df.columns:
    df = df[~pd.isnull(df[col])]

回答 3

df.dropna(subset=['columnName1', 'columnName2'])
df.dropna(subset=['columnName1', 'columnName2'])

如何遍历熊猫数据框的列以运行回归

问题:如何遍历熊猫数据框的列以运行回归

我敢肯定这很简单,但是作为python的完整新手,我在弄清楚如何遍历pandas数据帧中的变量并对每个变量进行回归时都遇到了麻烦。

这是我在做什么:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

我知道我可以像这样进行回归:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

但是假设我要对数据框中的每一列执行此操作。特别是,我想在FSTMX上还原FIUIX,然后在FSTMX上还原FSAIX,然后在FSTMX上还原FSAVX。每次回归后,我想存储残差。

我尝试了以下各种版本,但语法一定有误:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

我认为问题是我不知道如何按键引用return列,所以returns[k]可能是错误的。

任何有关最佳方法的指导将不胜感激。也许我缺少一种常见的熊猫方法。

I’m sure this is simple, but as a complete newbie to python, I’m having trouble figuring out how to iterate over variables in a pandas dataframe and run a regression with each.

Here’s what I’m doing:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

I know I can run a regression like this:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.

I’ve tried various versions of the following, but I must be getting the syntax wrong:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

I think the problem is I don’t know how to refer to the returns column by key, so returns[k] is probably wrong.

Any guidance on the best way to do this would be much appreciated. Perhaps there’s a common pandas approach I’m missing.


回答 0

for column in df:
    print(df[column])
for column in df:
    print(df[column])

回答 1

您可以使用iteritems()

for name, values in df.iteritems():
    print('{name}: {value}'.format(name=name, value=values[0]))

You can use iteritems():

for name, values in df.iteritems():
    print('{name}: {value}'.format(name=name, value=values[0]))

回答 2

这个答案是要遍历DF中的选定列以及所有列。

df.columns给出包含DF中所有列名称的列表。现在,如果要遍历所有列,则不是很有帮助。但是,当您只想遍历所选列时,它会派上用场。

我们可以根据需要轻松使用Python的列表切片对df.columns进行切片。例如,要遍历除第一列之外的所有列,我们可以这样做:

for column in df.columns[1:]:
    print(df[column])

类似于以相反的顺序遍历所有列,我们可以执行以下操作:

for column in df.columns[::-1]:
    print(df[column])

我们可以使用这种技术以许多很酷的方式遍历所有列。还请记住,您可以使用以下命令轻松获取所有列的索引:

for ind, column in enumerate(df.columns):
    print(ind, column)

This answer is to iterate over selected columns as well as all columns in a DF.

df.columns gives a list containing all the columns’ names in the DF. Now that isn’t very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

We can use Python’s list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

for column in df.columns[1:]:
    print(df[column])

Similarly to iterate over all the columns in reversed order, we can do:

for column in df.columns[::-1]:
    print(df[column])

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

for ind, column in enumerate(df.columns):
    print(ind, column)

回答 3

您可以使用来按位置索引数据框列ix

df1.ix[:,1]

例如,这将返回第一列。(0为索引)

df1.ix[0,]

这将返回第一行。

df1.ix[:,1]

这将是第0行与第1列的交集处的值:

df1.ix[0,1]

等等。因此,您可以enumerate() returns.keys():并使用数字来索引数据框。

You can index dataframe columns by the position using ix.

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

df1.ix[0,]

This returns the first row.

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

df1.ix[0,1]

and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.


回答 4

一种解决方法是对进行转置DataFrame并在行上进行迭代。

for column_name, column in df.transpose().iterrows():
    print column_name

A workaround is to transpose the DataFrame and iterate over the rows.

for column_name, column in df.transpose().iterrows():
    print column_name

回答 5

使用列表推导,您可以获得所有列名(标题):

[column for column in df]

Using list comprehension, you can get all the columns names (header):

[column for column in df]


回答 6

根据接受的答案,是否还需要与各列相对应的索引

for i, column in enumerate(df):
    print i, df[column]

上面的df[column]类型是Series,可以简单地转换为numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])

Based on the accepted answer, if an index corresponding to each column is also desired:

for i, column in enumerate(df):
    print i, df[column]

The above df[column] type is Series, which can simply be converted into numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])

回答 7

我来晚了,但是这是我的方法。步骤:

  1. 创建所有列的列表
  2. 使用itertools进行x组合
  3. 将每个结果R平方值与排除的列列表一起附加到结果数据帧
  4. 以R平方的降序对结果DF排序,以找出最合适的DF。

这是我在DataFrame上使用的称为的代码aft_tmt。随意推断您的用例。

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

I’m a bit late but here’s how I did this. The steps:

  1. Create a list of all columns
  2. Use itertools to take x combinations
  3. Append each result R squared value to a result dataframe along with excluded column list
  4. Sort the result DF in descending order of R squared to see which is the best fit.

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

如何在Pandas中的特定列索引处插入列?

问题:如何在Pandas中的特定列索引处插入列?

我可以在熊猫的特定列索引处插入列吗?

import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0

这会将列n作为的最后一列df,但是没有办法告诉df您将其放在n开头吗?

Can I insert a column at a specific column index in pandas?

import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0

This will put column n as the last column of df, but isn’t there a way to tell df to put n at the beginning?


回答 0

参见文档:http : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.insert.html

使用loc = 0将在开头插入

df.insert(loc, column, value)

df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})

df
Out: 
   B  C
0  1  4
1  2  5
2  3  6

idx = 0
new_col = [7, 8, 9]  # can be a list, a Series, an array or a scalar   
df.insert(loc=idx, column='A', value=new_col)

df
Out: 
   A  B  C
0  7  1  4
1  8  2  5
2  9  3  6

see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html

using loc = 0 will insert at the beginning

df.insert(loc, column, value)

df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})

df
Out: 
   B  C
0  1  4
1  2  5
2  3  6

idx = 0
new_col = [7, 8, 9]  # can be a list, a Series, an array or a scalar   
df.insert(loc=idx, column='A', value=new_col)

df
Out: 
   A  B  C
0  7  1  4
1  8  2  5
2  9  3  6

回答 1

您可以尝试将列提取为列表,根据需要对其进行按摩,然后为数据框重新编制索引:

>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)

   n  l  v
0  0  a  1
1  0  b  2
2  0  c  1
3  0  d  2

编辑:这可以在一行中完成;但是,这看起来有点难看。也许会有更清洁的建议…

>>> df.reindex(columns=['n']+df.columns[:-1].tolist())

   n  l  v
0  0  a  1
1  0  b  2
2  0  c  1
3  0  d  2

You could try to extract columns as list, massage this as you want, and reindex your dataframe:

>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)

   n  l  v
0  0  a  1
1  0  b  2
2  0  c  1
3  0  d  2

EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come…

>>> df.reindex(columns=['n']+df.columns[:-1].tolist())

   n  l  v
0  0  a  1
1  0  b  2
2  0  c  1
3  0  d  2

回答 2

如果要为所有行使用一个值:

df.insert(0,'name_of_column','')
df['name_of_column'] = value

编辑:

你也可以:

df.insert(0,'name_of_column',value)

If you want a single value for all rows:

df.insert(0,'name_of_column','')
df['name_of_column'] = value

Edit:

You can also:

df.insert(0,'name_of_column',value)

回答 3

这是一个非常简单的答案(仅一行)。

在将“ n”列添加到df中之后,您可以按照以下步骤进行操作。

import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0

df
    l   v   n
0   a   1   0
1   b   2   0
2   c   1   0
3   d   2   0

# here you can add the below code and it should work.
df = df[list('nlv')]
df

    n   l   v
0   0   a   1
1   0   b   2
2   0   c   1
3   0   d   2



However, if you have words in your columns names instead of letters. It should include two brackets around your column names. 

import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2

df

    Upper   Lower   Net Mid Zsore
0   a       1       0   2   2
1   b       2       0   2   2
2   c       1       0   2   2
3   d       2       0   2   2

# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

   Mid  Upper   Lower   Net Zsore
0   2   a       1       0   2
1   2   b       2       0   2
2   2   c       1       0   2
3   2   d       2       0   2

Here is a very simple answer to this(only one line).

You can do that after you added the ‘n’ column into your df as follows.

import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0

df
    l   v   n
0   a   1   0
1   b   2   0
2   c   1   0
3   d   2   0

# here you can add the below code and it should work.
df = df[list('nlv')]
df

    n   l   v
0   0   a   1
1   0   b   2
2   0   c   1
3   0   d   2



However, if you have words in your columns names instead of letters. It should include two brackets around your column names. 

import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2

df

    Upper   Lower   Net Mid Zsore
0   a       1       0   2   2
1   b       2       0   2   2
2   c       1       0   2   2
3   d       2       0   2   2

# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

   Mid  Upper   Lower   Net Zsore
0   2   a       1       0   2
1   2   b       2       0   2
2   2   c       1       0   2
3   2   d       2       0   2

将列添加到具有恒定值的数据框

问题:将列添加到具有恒定值的数据框

我有一个现有的数据框,我需要添加一个额外的列,每行将包含相同的值。

现有的df:

Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450

新的df:

Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450

我知道如何追加现有的series / dataframe列。但这是另一种情况,因为我所需要的只是添加“名称”列,并将每一行设置为相同的值,在本例中为“ abc”。

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.

Existing df:

Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450

New df:

Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450

I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the ‘Name’ column and set every row to the same value, in this case ‘abc’.


回答 0

df['Name']='abc' 将添加新列并将所有行设置为该值:

In [79]:

df
Out[79]:
         Date, Open, High,  Low,  Close
0  01-01-2015,  565,  600,  400,    450
In [80]:

df['Name'] = 'abc'
df
Out[80]:
         Date, Open, High,  Low,  Close Name
0  01-01-2015,  565,  600,  400,    450  abc

df['Name']='abc' will add the new column and set all rows to that value:

In [79]:

df
Out[79]:
         Date, Open, High,  Low,  Close
0  01-01-2015,  565,  600,  400,    450
In [80]:

df['Name'] = 'abc'
df
Out[80]:
         Date, Open, High,  Low,  Close Name
0  01-01-2015,  565,  600,  400,    450  abc

回答 1

您可以使用insert指定要在何处添加新列。在这种情况下,我通常0将新列放在左侧。

df.insert(0, 'Name', 'abc')

  Name        Date  Open  High  Low  Close
0  abc  01-01-2015   565   600  400    450

You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.

df.insert(0, 'Name', 'abc')

  Name        Date  Open  High  Low  Close
0  abc  01-01-2015   565   600  400    450

回答 2

单班轮工程

df['Name'] = 'abc'

创建一Name列并将所有行设置为abcvalue

Single liner works

df['Name'] = 'abc'

Creates a Name column and sets all rows to abc value


回答 3

总结其他人的建议,并添加第三种方法

您可以:

  • 分配(** kwargs)

    df.assign(Name='abc')
  • 访问新的列系列(将被创建)并进行设置:

    df['Name'] = 'abc'
  • 插入(位置,列,值,allow_duplicates = False)

    df.insert(0, 'Name', 'abc')

    参数loc(0 <= loc <= len(columns))允许您在所需的位置插入列。

    “禄”为您提供了索引你的列将在插入后。例如,上面的代码将Name列插入第0列,即它将插入到第一列之前,成为新的第一列。(索引从0开始)。

所有这些方法都允许您从系列中添加新列(只需将上面的’abc’默认参数替换为系列)。

Summing up what the others have suggested, and adding a third way

You can:

  • assign(**kwargs):

    df.assign(Name='abc')
    
  • access the new column series (it will be created) and set it:

    df['Name'] = 'abc'
    
  • insert(loc, column, value, allow_duplicates=False)

    df.insert(0, 'Name', 'abc')
    

    where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.

    ‘loc’ gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).

All these methods allow you to add a new column from a Series as well (just substitute the ‘abc’ default argument above with the series).


如何检索Pandas数据框中的列数?

问题:如何检索Pandas数据框中的列数?

您如何以编程方式检索熊猫数据框中的列数?我希望有这样的东西:

df.num_columns

How do you programmatically retrieve the number of columns in a pandas dataframe? I was hoping for something like:

df.num_columns

回答 0

像这样:

import pandas as pd
df = pd.DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

len(df.columns)
3

Like so:

import pandas as pd
df = pd.DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

len(df.columns)
3

回答 1

选择:

df.shape[1]

df.shape[0]是行数)

Alternative:

df.shape[1]

(df.shape[0] is the number of rows)


回答 2

如果保存数据帧的变量称为df,则:

len(df.columns)

给出列数。

对于那些想要行数的人:

len(df.index)

对于包含行数和列数的元组:

df.shape

If the variable holding the dataframe is called df, then:

len(df.columns)

gives the number of columns.

And for those who want the number of rows:

len(df.index)

For a tuple containing the number of both rows and columns:

df.shape

回答 3

len(list(df))对我有用。

This worked for me len(list(df)).


回答 4

df.info()函数将为您提供如下结果。如果您使用的是不带sep参数或不带“,”的sep的Pandas的read_csv方法。

raw_data = pd.read_csv("a1:\aa2/aaa3/data.csv")
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Columns: 145 entries, R_fighter to R_age

df.info() function will give you result something like as below. If you are using read_csv method of Pandas without sep parameter or sep with “,”.

raw_data = pd.read_csv("a1:\aa2/aaa3/data.csv")
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Columns: 145 entries, R_fighter to R_age

回答 5

有多种选择来获取列号和列信息,例如:
让我们检查一下。

local_df = pd.DataFrame(np.random.randint(1,12,size =(2,6)),列= [‘a’,’b’,’c’,’d’,’e’,’f ‘])1. local_df.shape [1]-> Shape属性返回元组为(行和列)(0,1)。

  1. local_df.info()-> info方法将返回有关数据框及其列的详细信息,例如列数,列的数据类型,非空值计数,数据帧的内存使用情况

  2. len(local_df.columns)->列属性将返回数据框列的索引对象,而len函数将返回可用列总数。

  3. local_df.head(0)->具有参数0的head方法将返回df的第一行,实际上仅是标题。

假设列数不超过10。为了循环乐趣:local_df中x的li_count = 0:li_count = li_count +1 print(li_count)

There are multiple option to get column number and column information such as:
let’s check them.

local_df = pd.DataFrame(np.random.randint(1,12,size=(2,6)),columns =[‘a’,’b’,’c’,’d’,’e’,’f’]) 1. local_df.shape[1] –> Shape attribute return tuple as (row & columns) (0,1).

  1. local_df.info() –> info Method will return detailed information about data frame and it’s columns such column count, data type of columns, Not null value count, memory usage by Data Frame

  2. len(local_df.columns) –> columns attribute will return index object of data frame columns & len function will return total available columns.

  3. local_df.head(0) –> head method with parameter 0 will return 1st row of df which actually nothing but header.

Assuming number of columns are not more than 10. For loop fun: li_count =0 for x in local_df: li_count =li_count + 1 print(li_count)


熊猫重新采样文档

问题:熊猫重新采样文档

因此,我完全理解如何使用resample,但是文档在解释这些选项方面做得不好。

因此,resample除了以下两个以外,函数中的大多数选项都非常简单:

  • rule:表示目标转换的偏移量字符串或对象
  • 方式:字符串,下采样或重新采样的方法,默认为“均值”

因此,通过查看我在网上找到的尽可能多的示例,我可以看到规则可以'D'在一天,'xMin'几分钟,'xL'几毫秒内完成,但这就是我所能找到的全部。

如何我看到的情况如下:'first'np.max'last''mean',和'n1n2n3n4...nx'其中nx为每列索引的第一个字母。

因此,我在文档中缺少的某个地方显示了pandas.resample规则的每个选项以及如何输入?如果是,在哪里,因为我找不到它。如果没有,那么他们有什么选择?

So I completely understand how to use resample, but the documentation does not do a good job explaining the options.

So most options in the resample function are pretty straight forward except for these two:

  • rule : the offset string or object representing target conversion
  • how : string, method for down- or re-sampling, default to ‘mean’

So from looking at as many examples as I found online I can see for rule you can do 'D' for day, 'xMin' for minutes, 'xL' for milliseconds, but that is all I could find.

for how I have seen the following: 'first', np.max, 'last', 'mean', and 'n1n2n3n4...nx' where nx is the first letter of each column index.

So is there somewhere in the documentation that I am missing that displays every option for pandas.resample‘s rule and how inputs? If yes, where because I could not find it. If no, what are all the options for them?


回答 0

B         business day frequency
C         custom business day frequency (experimental)
D         calendar day frequency
W         weekly frequency
M         month end frequency
SM        semi-month end frequency (15th and end of month)
BM        business month end frequency
CBM       custom business month end frequency
MS        month start frequency
SMS       semi-month start frequency (1st and 15th)
BMS       business month start frequency
CBMS      custom business month start frequency
Q         quarter end frequency
BQ        business quarter endfrequency
QS        quarter start frequency
BQS       business quarter start frequency
A         year end frequency
BA, BY    business year end frequency
AS, YS    year start frequency
BAS, BYS  business year start frequency
BH        business hour frequency
H         hourly frequency
T, min    minutely frequency
S         secondly frequency
L, ms     milliseconds
U, us     microseconds
N         nanoseconds

请参阅时间序列文档。它包括偏移量列表(和“锚定”偏移量),以及有关重采样的部分。

请注意,没有所有不同how选项的列表,因为它可以是任何NumPy数组函数,并且可以通过nameby传递通过groupby分派可用的任何函数how

B         business day frequency
C         custom business day frequency (experimental)
D         calendar day frequency
W         weekly frequency
M         month end frequency
SM        semi-month end frequency (15th and end of month)
BM        business month end frequency
CBM       custom business month end frequency
MS        month start frequency
SMS       semi-month start frequency (1st and 15th)
BMS       business month start frequency
CBMS      custom business month start frequency
Q         quarter end frequency
BQ        business quarter endfrequency
QS        quarter start frequency
BQS       business quarter start frequency
A         year end frequency
BA, BY    business year end frequency
AS, YS    year start frequency
BAS, BYS  business year start frequency
BH        business hour frequency
H         hourly frequency
T, min    minutely frequency
S         secondly frequency
L, ms     milliseconds
U, us     microseconds
N         nanoseconds

See the timeseries documentation. It includes a list of offsets (and ‘anchored’ offsets), and a section about resampling.

Note that there isn’t a list of all the different how options, because it can be any NumPy array function and any function that is available via groupby dispatching can be passed to how by name.


回答 1

不仅限于此,但您可能正在寻找以下列表:

B   business day frequency
C   custom business day frequency (experimental)
D   calendar day frequency
W   weekly frequency
M   month end frequency
BM  business month end frequency
MS  month start frequency
BMS business month start frequency
Q   quarter end frequency
BQ  business quarter endfrequency
QS  quarter start frequency
BQS business quarter start frequency
A   year end frequency
BA  business year end frequency
AS  year start frequency
BAS business year start frequency
H   hourly frequency
T   minutely frequency
S   secondly frequency
L   milliseconds
U   microseconds

来源:http//pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

There’s more to it than this, but you’re probably looking for this list:

B   business day frequency
C   custom business day frequency (experimental)
D   calendar day frequency
W   weekly frequency
M   month end frequency
BM  business month end frequency
MS  month start frequency
BMS business month start frequency
Q   quarter end frequency
BQ  business quarter endfrequency
QS  quarter start frequency
BQS business quarter start frequency
A   year end frequency
BA  business year end frequency
AS  year start frequency
BAS business year start frequency
H   hourly frequency
T   minutely frequency
S   secondly frequency
L   milliseconds
U   microseconds

Source: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases


根据数据类型获取熊猫数据框列的列表

问题:根据数据类型获取熊猫数据框列的列表

如果我有一个包含以下列的数据框:

1. NAME                                     object
2. On_Time                                      object
3. On_Budget                                    object
4. %actual_hr                                  float64
5. Baseline Start Date                  datetime64[ns]
6. Forecast Start Date                  datetime64[ns] 

我想说:这是一个数据框,请给我列出对象类型或日期时间类型的列的列表?

我有一个将数字(Float64)转换为两位小数的函数,并且我想使用此数据框列的特定类型的列表,并通过此函数运行它以将它们全部转换为2dp。

也许:

For c in col_list: if c.dtype = "Something"
list[]
List.append(c)?

If I have a dataframe with the following columns:

1. NAME                                     object
2. On_Time                                      object
3. On_Budget                                    object
4. %actual_hr                                  float64
5. Baseline Start Date                  datetime64[ns]
6. Forecast Start Date                  datetime64[ns] 

I would like to be able to say: here is a dataframe, give me a list of the columns which are of type Object or of type DateTime?

I have a function which converts numbers (Float64) to two decimal places, and I would like to use this list of dataframe columns, of a particular type, and run it through this function to convert them all to 2dp.

Maybe:

For c in col_list: if c.dtype = "Something"
list[]
List.append(c)?

回答 0

如果您想要某种类型的列的列表,可以使用groupby

>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
   A       B  C  D   E
0  1  2.3456  c  d  78

[1 rows x 5 columns]
>>> df.dtypes
A      int64
B    float64
C     object
D     object
E      int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}

If you want a list of columns of a certain type, you can use groupby:

>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
   A       B  C  D   E
0  1  2.3456  c  d  78

[1 rows x 5 columns]
>>> df.dtypes
A      int64
B    float64
C     object
D     object
E      int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}

回答 1

从pandas v0.14.1开始,您可以利用dtype select_dtypes()选择列

In [2]: df = pd.DataFrame({'NAME': list('abcdef'),
    'On_Time': [True, False] * 3,
    'On_Budget': [False, True] * 3})

In [3]: df.select_dtypes(include=['bool'])
Out[3]:
  On_Budget On_Time
0     False    True
1      True   False
2     False    True
3      True   False
4     False    True
5      True   False

In [4]: mylist = list(df.select_dtypes(include=['bool']).columns)

In [5]: mylist
Out[5]: ['On_Budget', 'On_Time']

As of pandas v0.14.1, you can utilize select_dtypes() to select columns by dtype

In [2]: df = pd.DataFrame({'NAME': list('abcdef'),
    'On_Time': [True, False] * 3,
    'On_Budget': [False, True] * 3})

In [3]: df.select_dtypes(include=['bool'])
Out[3]:
  On_Budget On_Time
0     False    True
1      True   False
2     False    True
3      True   False
4     False    True
5      True   False

In [4]: mylist = list(df.select_dtypes(include=['bool']).columns)

In [5]: mylist
Out[5]: ['On_Budget', 'On_Time']

回答 2

使用dtype将为您提供所需列的数据类型:

dataframe['column1'].dtype

如果你想知道的数据类型都一下子列,你可以使用复数dtypedtypes

dataframe.dtypes

Using dtype will give you desired column’s data type:

dataframe['column1'].dtype

if you want to know data types of all the column at once, you can use plural of dtype as dtypes:

dataframe.dtypes

回答 3

您可以在dtypes属性上使用布尔掩码:

In [11]: df = pd.DataFrame([[1, 2.3456, 'c']])

In [12]: df.dtypes
Out[12]: 
0      int64
1    float64
2     object
dtype: object

In [13]: msk = df.dtypes == np.float64  # or object, etc.

In [14]: msk
Out[14]: 
0    False
1     True
2    False
dtype: bool

您可以只查看具有所需dtype的那些列:

In [15]: df.loc[:, msk]
Out[15]: 
        1
0  2.3456

现在,您可以使用回合(或任意回合)并将其分配回去:

In [16]: np.round(df.loc[:, msk], 2)
Out[16]: 
      1
0  2.35

In [17]: df.loc[:, msk] = np.round(df.loc[:, msk], 2)

In [18]: df
Out[18]: 
   0     1  2
0  1  2.35  c

You can use boolean mask on the dtypes attribute:

In [11]: df = pd.DataFrame([[1, 2.3456, 'c']])

In [12]: df.dtypes
Out[12]: 
0      int64
1    float64
2     object
dtype: object

In [13]: msk = df.dtypes == np.float64  # or object, etc.

In [14]: msk
Out[14]: 
0    False
1     True
2    False
dtype: bool

You can look at just those columns with the desired dtype:

In [15]: df.loc[:, msk]
Out[15]: 
        1
0  2.3456

Now you can use round (or whatever) and assign it back:

In [16]: np.round(df.loc[:, msk], 2)
Out[16]: 
      1
0  2.35

In [17]: df.loc[:, msk] = np.round(df.loc[:, msk], 2)

In [18]: df
Out[18]: 
   0     1  2
0  1  2.35  c

回答 4

list(df.select_dtypes(['object']).columns)

这应该可以解决问题

list(df.select_dtypes(['object']).columns)

This should do the trick


回答 5

默认情况下使用df.info(verbose=True)哪里df是熊猫数据农场verbose=False

use df.info(verbose=True) where df is a pandas datafarme, by default verbose=False


回答 6

获取某些dtype列的列表的最直接方法,例如’object’:

df.select_dtypes(include='object').columns

例如:

>>df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>df.dtypes

A      int64
B    float64
C     object
D     object
E      int64
dtype: object

要获取所有“对象” dtype列:

>>df.select_dtypes(include='object').columns

Index(['C', 'D'], dtype='object')

仅列出:

>>list(df.select_dtypes(include='object').columns)

['C', 'D']   

The most direct way to get a list of columns of certain dtype e.g. ‘object’:

df.select_dtypes(include='object').columns

For example:

>>df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>df.dtypes

A      int64
B    float64
C     object
D     object
E      int64
dtype: object

To get all ‘object’ dtype columns:

>>df.select_dtypes(include='object').columns

Index(['C', 'D'], dtype='object')

For just the list:

>>list(df.select_dtypes(include='object').columns)

['C', 'D']   

回答 7

如果只需要对象列的列表,则可以执行以下操作:

non_numerics = [x for x in df.columns \
                if not (df[x].dtype == np.float64 \
                        or df[x].dtype == np.int64)]

然后,如果要获取另一个仅包含数字的列表:

numerics = [x for x in df.columns if x not in non_numerics]

If you want a list of only the object columns you could do:

non_numerics = [x for x in df.columns \
                if not (df[x].dtype == np.float64 \
                        or df[x].dtype == np.int64)]

and then if you want to get another list of only the numerics:

numerics = [x for x in df.columns if x not in non_numerics]

回答 8

我想出了这三个班轮

本质上,这是它的作用:

  1. 获取列名称及其各自的数据类型。
  2. 我可以选择将其输出到csv。

inp = pd.read_csv('filename.csv') # read input. Add read_csv arguments as needed
columns = pd.DataFrame({'column_names': inp.columns, 'datatypes': inp.dtypes})
columns.to_csv(inp+'columns_list.csv', encoding='utf-8') # encoding is optional

这使我的生活变得更加轻松,可以随时尝试生成模式。希望这可以帮助

I came up with this three liner.

Essentially, here’s what it does:

  1. Fetch the column names and their respective data types.
  2. I am optionally outputting it to a csv.

inp = pd.read_csv('filename.csv') # read input. Add read_csv arguments as needed
columns = pd.DataFrame({'column_names': inp.columns, 'datatypes': inp.dtypes})
columns.to_csv(inp+'columns_list.csv', encoding='utf-8') # encoding is optional

This made my life much easier in trying to generate schemas on the fly. Hope this helps


回答 9

为了吉雪莉

def col_types(x,pd):
    dtypes=x.dtypes
    dtypes_col=dtypes.index
    dtypes_type=dtypes.value
    column_types=dict(zip(dtypes_col,dtypes_type))
    return column_types

for yoshiserry;

def col_types(x,pd):
    dtypes=x.dtypes
    dtypes_col=dtypes.index
    dtypes_type=dtypes.value
    column_types=dict(zip(dtypes_col,dtypes_type))
    return column_types

回答 10

我用infer_objects()

Docstring:尝试为对象列推断更好的dtype。

尝试对对象类型化的列进行软转换,而使非对象和不可转换的列保持不变。推理规则与常规Series / DataFrame构造过程中的规则相同。

df.infer_objects().dtypes

I use infer_objects()

Docstring: Attempt to infer better dtypes for object columns.

Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.

df.infer_objects().dtypes