from interval import interval
a = interval[1,5]
# interval([1.0, 5.0])
print(3 in a)
# True
此外,你还可以构建一个多区间:
from interval import interval
a = interval([0, 1], [2, 3], [10, 15])
print(2.5 in a)
# True
interval.hall 方法还可以将多个区间合并,取其最小及最大值为边界:
from interval import interval
a = interval.hull((interval[1, 3], interval[10, 15], interval[16, 2222]))
# interval([1.0, 2222.0])
print(1231 in a)
# True
区间并集计算:
from interval import interval
a = interval.union([interval([1, 3], [4, 6]), interval([2, 5], 9)])
# interval([1.0, 6.0], [9.0])
print(5 in a)
# True
print(8 in a)
# False
import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel
# 初始化
pandarallel.initialize()
df_size = int(3e7)
df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
b=np.random.rand(df_size)))
def func(df):
dum = 0
for item in df.b:
dum += math.log10(math.sqrt(math.exp(item**2)))
return dum / len(df.b)
# 正常处理
res = df.groupby("a").apply(func)
# 并行处理
res_parallel = df.groupby("a").parallel_apply(func)
res.equals(res_parallel)
又比如 DataFrame.groupby.rolling.apply:
import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel
# 初始化
pandarallel.initialize()
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 300, df_size),
b=np.random.rand(df_size)))
def func(x):
return x.iloc[0] + x.iloc[1] ** 2 + x.iloc[2] ** 3 + x.iloc[3] ** 4
# 正常处理
res = df.groupby('a').b.rolling(4).apply(func, raw=False)
# 并行处理
res_parallel = df.groupby('a').b.rolling(4).parallel_apply(func, raw=False)
res.equals(res_parallel)
This is because of using integer indices (ix selects those by label over -3 rather than position, and this is by design: see integer indexing in pandas “gotchas”*).
*In newer versions of pandas prefer loc or iloc to remove the ambiguity of ix as position or label:
I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn’t helpful if you’re uncertain that integer won’t show up in your source data though. My method with will format floats without their decimal values and convert nulls to None’s. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
A B C D
0 foo one 0.1620030.0874691 bar one -1.156319-1.5262722 foo two 0.833892-1.6663043 bar three -2.026673-0.3220574 foo two 0.411452-0.9543715 bar two 0.765878-0.0959686 foo one -0.6548900.6780917 foo three -1.789842-1.130922
> df.groupby('A').transform(lambda x:(x['C']- x['D']))ValueError: could not broadcast input array from shape (5) into shape (5,3)> df.groupby('A').transform(lambda x:(x['C']- x['D']).mean())TypeError: cannot concatenate a non-NDFrame object
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore =lambda x:(x - x.mean())/ x.std()
transformed = ts.groupby(key).transform(zscore)
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)
> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object
Why?The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?
For reference, below is the construction of the original dataframe above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas','Texas','Florida','Florida'],'a':[4,5,1,3],'b':[6,10,3,11]})State a b
0Texas461Texas5102Florida133Florida311
def return_three(x):return np.array([1,2,3])
df.groupby('State').transform(return_three)ValueError: transform must return a scalar value for each group
该错误消息并不能真正说明问题。您必须返回与组长度相同的序列。因此,这样的功能将起作用:
def rand_group_len(x):return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
00.9620700.15144010.4409560.78217620.6422180.48325730.0560470.238208
返回单个标量对象也适用于 transform
如果仅从自定义函数返回单个标量,transform则将其用于组中的每一行:
def group_sum(x):return x.sum()
df.groupby('State').transform(group_sum)
a b
0916191624143414
There are two major differences between the transform and apply groupby methods.
Input:
apply implicitly passes all the columns for each group as a DataFrame to the custom function.
while transform passes each column for each group individually as a Series to the custom function.
Output:
The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.
Inspecting the custom function
It can help quite a bit to inspect the input to your custom function passed to apply or transform.
Examples
Let’s create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Let’s create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.
def inspect(x):
print(type(x))
raise
Now let’s pass this function to both the groupby apply and transform methods to see what object is passed to it:
As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn’t worry about.
It is passed a Series – a totally different Pandas object.
So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:
def subtract_two(x):
return x['a'] - x['b']
df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')
We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:
The output is a Series and a little confusing as the original index is kept, but we have access to all columns.
Displaying the passed pandas object
It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:
Transform must return a single dimensional sequence the same size as the group
The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:
def return_three(x):
return np.array([1, 2, 3])
df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group
The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:
def rand_group_len(x):
return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208
Returning a single scalar object also works for transform
If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:
def group_sum(x):
return x.sum()
df.groupby('State').transform(group_sum)
a b
0 9 16
1 9 16
2 4 14
3 4 14
zscore =lambda x:(x - x.mean())/ x.std()# Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
将生成:
C D
00.9890.1281-0.4780.48920.889-0.5893-0.671-1.15040.034-0.28551.1490.6626-1.404-0.9077-0.5091.653
ValueError: operands could not be broadcast together with shapes (6,)(2,)
那么还有什么.transform用处呢?最简单的情况是尝试将归约函数的结果分配回原始数据帧。
df['sum_C']= df.groupby('A')['C'].transform(sum)
df.sort('A')# to clearly see the scalar ('sum') applies to the whole column of the group
生成:
A B C D sum_C
1 bar one 1.9980.5933.9733 bar three 1.287-0.6393.9735 bar two 0.687-1.0273.9734 foo two 0.2051.2744.3732 foo two 0.1280.9244.3736 foo one 2.113-0.5164.3737 foo three 0.657-1.1794.3730 foo one 1.2700.2014.373
As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.
My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:
You asked .transform to take values from two columns and ‘it’ actually does not ‘see’ both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) ‘made’ of scalars which are repeated len(input_column) times.
So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).
Consider this example (on your dataframe):
zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:
df.groupby('A').apply(zscore)
gives error:
ValueError: operands could not be broadcast together with shapes (6,) (2,)
So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.
df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
yielding:
A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373
Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:
df.groupby('A')['C'].apply(sum)
giving:
A
bar 3.973
foo 4.373
There are also cases when .transform is used to filter the data:
df[df.groupby(['B'])['D'].transform(sum) < -1]
A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179
I hope this adds a bit more clarity.
回答 2
我将使用一个非常简单的代码片段来说明不同之处:
test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3],'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
DataFrame看起来像这样:
id price
011122233312423531613721832
该表中有3个客户ID,每个客户进行三笔交易,每次支付1,2,3美元。
现在,我想找到每个客户的最低付款额。有两种方法:
使用apply:
grouping.min()
回报看起来像这样:
id
112131Name: price, dtype: int64
pandas.core.series.Series# return typeInt64Index([1,2,3], dtype='int64', name='id')#The returned Series' index# lenght is 3
使用transform:
分组变换(最小值)
回报看起来像这样:
011121314151617181Name: price, dtype: int64
pandas.core.series.Series# return typeRangeIndex(start=0, stop=9, step=1)# The returned Series' index# length is 9
如果要回答What is the minimum price paid by each customer,则该apply方法是更适合选择的一种。
如果要回答What is the difference between the amount paid for each transaction vs the minimum payment,则要使用transform,因为:
test['minimum']= grouping.transform(min)# ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.
Now, I want to find the minimum payment made by each customer. There are two ways of doing it:
Using apply:
grouping.min()
The return looks like this:
id
1 1
2 1
3 1
Name: price, dtype: int64
pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
Using transform:
grouping.transform(min)
The return looks like this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64
pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.
If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.
If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:
test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
Apply does not work here simply because it returns a Series of size 3, but the original df’s length is 9. You cannot integrate it back to the original df easily.
Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options.
Is there a reason why I should ever use .loc or .iloc over the most general option .ix?
I understand that .loc, iloc, at, and iat may provide some guaranteed correctness that .ix can’t offer, but I’ve also read where .ix tends to be the fastest solution across the board.
Please explain the real-world, best-practices reasoning behind utilizing anything other than .ix?
loc: only work on index iloc: work on position ix: You can get data from dataframe without it being in the index at: get scalar values. It’s a very fast loc iat: Get scalar values. It’s a very fast iloc
# position based, but we can get the position# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'),3, takable=True)
Updated for pandas0.20 given that ix is deprecated. This demonstrates not only how to use loc, iloc, at, iat, set_value, but how to accomplish, mixed positional/label based indexing.
loc – label based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.
Special Note: when a scalar indexer is passed, loc can assign a new index or column value that didn’t exist before.
# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3
df.loc[df.index[1:3], 'ColName'] = 3
iloc – position based
Similar to loc except with positions rather that index values. However, you cannot assign new columns or indices.
# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3
df.iloc[2, 4] = 3
df.iloc[:3, 2:4] = 3
at – label based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.
Advantage over loc is that this is faster. Disadvantage is that you can’t use arrays for indexers.
# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3
df.at['C', 'ColName'] = 3
iat – position based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.
Advantage over iloc is that this is faster. Disadvantage is that you can’t use arrays for indexers.
# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3
set_value – label based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns
Advantage Super fast, because there is very little overhead! Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.
# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)
set_value with takable=True – position based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.
Advantage Super fast, because there is very little overhead! Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.
# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)
There are two primary ways that pandas makes selections from a DataFrame.
By Label
By Integer Location
The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what .iloc stands for. The key word here is INTEGER – you must use integers when selecting by integer location.
Before showing the summary let’s all make sure that …
.ix is deprecated and ambiguous and should never be used
There are three primary indexers for pandas. We have the indexing operator itself (the brackets []), .loc, and .iloc. Let’s summarize them:
[] – Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.
.loc – selects subsets of rows and columns by label only
.iloc – selects subsets of rows and columns by integer location only
I almost never use .at or .iat as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:
.at selects a single scalar value in the DataFrame by label only
.iat selects a single scalar value in the DataFrame by integer location only
In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.
Examples explaining .loc, .iloc, boolean selection and .at and .iat are shown below
We will first focus on the differences between .loc and .iloc. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let’s take a look at a sample DataFrame:
All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used as labels for the rows. Collectively, these row labels are known as the index.
The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.
Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:
Or alternatively, convert the index labels to integers with the get_loc index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the food and score columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, [], can slice can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.
df.at['Christina', 'color']
'black'
Selection with .iat is nearly identical to .iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations
df.iat[2, 5]
'FL'
回答 3
df = pd.DataFrame({'A':['a','b','c'],'B':[54,67,89]}, index=[100,200,300])
df
A B
100 a 54200 b 67300 c 89In[19]:
df.loc[100]Out[19]:
A a
B 54Name:100, dtype: object
In[20]:
df.iloc[0]Out[20]:
A a
B 54Name:100, dtype: object
In[24]:
df2 = df.set_index([df.index,'A'])
df2
Out[24]:
B
A
100 a 54200 b 67300 c 89In[25]:
df2.ix[100,'a']Out[25]:
B 54Name:(100, a), dtype: int64
df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])
df
A B
100 a 54
200 b 67
300 c 89
In [19]:
df.loc[100]
Out[19]:
A a
B 54
Name: 100, dtype: object
In [20]:
df.iloc[0]
Out[20]:
A a
B 54
Name: 100, dtype: object
In [24]:
df2 = df.set_index([df.index,'A'])
df2
Out[24]:
B
A
100 a 54
200 b 67
300 c 89
In [25]:
df2.ix[100, 'a']
Out[25]:
B 54
Name: (100, a), dtype: int64
回答 4
让我们从这个小df开始:
import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
df.iloc[3,3]Out[33]:33
df.iat[3,3]Out[34]:33
df.iloc[:3,:3]Out[35]:012300123110111213220212223330313233
df.iat[:3,:3]Traceback(most recent call last):... omissis ...ValueError:At based indexing on an integer index can only have integer indexers
因此,我们不能将.iat用于子集,而只能在其中使用.iloc。
但是,让我们尝试从较大的df中进行选择,并检查速度…
# -*- coding: utf-8 -*-"""
Created on Wed Feb 7 09:58:39 2018
@author: Fabio Pomi
"""import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()for j in df.index:for i in df.columns:
a=df.iloc[j,i]
t2=tm.time()for j in df.index:for i in df.columns:
a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100print('\nloc:%f at:%f prc:%f'%(loc,at,prc))
loc:10.485600 at:7.395423 prc:141.784987
df.iloc[3,3]
Out[33]: 33
df.iat[3,3]
Out[34]: 33
df.iloc[:3,:3]
Out[35]:
0 1 2 3
0 0 1 2 3
1 10 11 12 13
2 20 21 22 23
3 30 31 32 33
df.iat[:3,:3]
Traceback (most recent call last):
... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers
Thus we cannot use .iat for subset, where we must use .iloc only.
But let’s try both to select from a larger df and let’s check the speed …
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 7 09:58:39 2018
@author: Fabio Pomi
"""
import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
for i in df.columns:
a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
for i in df.columns:
a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))
loc:10.485600 at:7.395423 prc:141.784987
So with .loc we can manage subsets and with .at only a single scalar, but .at is faster than .loc
Traceback(most recent call last):File"test_searborn.py", line 11,in<module>
fig = sns_plot.get_figure()AttributeError:'PairGrid' object has no attribute 'get_figure'
AttributeError:'AxesSubplot' object has no attribute 'fig'When trying to access the figureAttributeError:'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function
The suggested solutions are incompatible with Seaborn 0.8.1
giving the following errors because the Seaborn interface has changed:
AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure
AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function
The following calls allow you to access the figure (Seaborn 0.8.1 compatible):
UPDATE:
I have recently used PairGrid object from seaborn to generate a plot similar to the one in this example.
In this case, since GridPlot is not a plot object like, for example, sns.swarmplot, it has no get_figure() function.
It is possible to directly access the matplotlib figure by
fig = myGridPlotObject.fig
Like previously suggested in other posts in this thread.
Some of the above solutions did not work for me. The .fig attribute was not found when I tried that and I was unable to use .savefig() directly. However, what did work was:
sns_plot.figure.savefig("output.png")
I am a newer Python user, so I do not know if this is due to an update. I wanted to mention it in case anybody else runs into the same issues as I did.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')
回答 7
也可以只创建一个matplotlib figure对象,然后使用plt.savefig(...):
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
plt.figure()# Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png')# Save that figure
Its also possible to just create a matplotlib figure object and then use plt.savefig(...):
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure
c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")print(c)
输出:
CountryRegion0Algeria AFRICA
1Angola AFRICA
2Benin AFRICA
3Botswana AFRICA
4Burkina AFRICA
5Burundi AFRICA
6Cameroon AFRICA
..................................
As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).
A simpler approach is to pass the correct url of the raw data directly to read_csv, you don’t have to pass a file like object, you can pass a url so you don’t need requests at all:
c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
print(c)
Output:
Country Region
0 Algeria AFRICA
1 Angola AFRICA
2 Benin AFRICA
3 Botswana AFRICA
4 Burkina AFRICA
5 Burundi AFRICA
6 Cameroon AFRICA
..................................
string or file handle / StringIO
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv
The problem you’re having is that the output you get into the variable ‘s’ is not a csv, but a html file.
In order to get the raw csv, you have to modify the url to:
Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module.
Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.
End result is this code:
from io import StringIO
import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text
c=pd.read_csv(StringIO(s))
output:
>>> c.head()
Country Region
0 Algeria AFRICA
1 Angola AFRICA
2 Benin AFRICA
3 Botswana AFRICA
4 Burkina AFRICA
回答 4
url ="https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep ="\t")