import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data =[1.5]*7+[2.5]*2+[3.5]*8+[4.5]*3+[5.5]*1+[6.5]*8
density = gaussian_kde(data)
xs = np.linspace(0,8,200)
density.covariance_factor =lambda:.25
density._compute_covariance()
plt.plot(xs,density(xs))
plt.show()
Sven has shown how to use the class gaussian_kde from Scipy, but you will notice that it doesn’t look quite like what you generated with R. This is because gaussian_kde tries to infer the bandwidth automatically. You can play with the bandwidth in a way by changing the function covariance_factor of the gaussian_kde class. First, here is what you get without changing that function:
However, if I use the following code:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density = gaussian_kde(data)
xs = np.linspace(0,8,200)
density.covariance_factor = lambda : .25
density._compute_covariance()
plt.plot(xs,density(xs))
plt.show()
I get
which is pretty close to what you are getting from R. What have I done? gaussian_kde uses a changable function, covariance_factor to calculate its bandwidth. Before changing the function, the value returned by covariance_factor for this data was about .5. Lowering this lowered the bandwidth. I had to call _compute_covariance after changing that function so that all of the factors would be calculated correctly. It isn’t an exact correspondence with the bw parameter from R, but hopefully it helps you get in the right direction.
import numpy as np
import seaborn as sns
data =[1.5]*7+[2.5]*2+[3.5]*8+[4.5]*3+[5.5]*1+[6.5]*8
sns.set_style('whitegrid')
sns.kdeplot(np.array(data), bw=0.5)
Use pandas dataframe plot (built on top of matplotlib):
import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
pd.DataFrame(data).plot(kind='density') # or pd.Series()
Option 2:
Use distplot of seaborn:
import seaborn as sns
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
sns.distplot(data, hist=False)
回答 3
也许尝试类似:
import matplotlib.pyplot as plt
import numpy
from scipy import stats
data =[1.5]*7+[2.5]*2+[3.5]*8+[4.5]*3+[5.5]*1+[6.5]*8
density = stats.kde.gaussian_kde(data)
x = numpy.arange(0.,8,.1)
plt.plot(x, density(x))
plt.show()
The density plot can also be created by using matplotlib:
The function plt.hist(data) returns the y and x values necessary for the density plot (see the documentation https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html).
Resultingly, the following code creates a density plot by using the matplotlib library:
import matplotlib.pyplot as plt
dat=[-1,2,1,4,-5,3,6,1,2,1,2,5,6,5,6,2,2,2]
a=plt.hist(dat,density=True)
plt.close()
plt.figure()
plt.plot(a[1][1:],a[0])
I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It’s even faster than the data.table package in R (my language of choice for analysis).
Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I’m not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?
Here’s the R code and the Python code used to benchmark the various packages.
It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.
Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn’t really the join itself (the algorithm), but a preliminary step.
Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R’s own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn’t hooked up yet to replace the levels to levels match.
Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.
Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.
I’ll need some time to confirm as it’s the first I’ve seen of the comparison to data.table as presented.
UPDATE from data.table v1.8.0 released July 2012
Internal function sortedmatch() removed and replaced with chmatch()
when matching i levels to x levels for columns of type ‘factor’. This
preliminary step was causing a (known) significant slowdown when the number
of levels of a factor column was large (e.g. >10,000). Exacerbated in
tests of joining four such columns, as demonstrated by Wes McKinney
(author of Python package Pandas). Matching 1 million strings of which
of which 600,000 are unique is now reduced from 16s to 0.5s, for example.
also in that release was :
character columns are now allowed in keys and are preferred to
factor. data.table() and setkey() no longer coerce character to
factor. Factors are still supported. Implements FR#1493, FR#1224
and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match()
and %in% for character vectors. R’s internal string cache is
utilised (no hash table is built). They are about 4 times faster
than match() on the example in ?chmatch.
As of Sep 2013 data.table is v1.8.10 on CRAN and we’re working on v1.9.0. NEWS is updated live.
But as I wrote originally, above :
data.table has time series merge in mind. Two aspects to that: i)
multi column ordered keys such as (id,datetime) ii) fast prevailing
join (roll=TRUE) a.k.a. last observation carried forward.
So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn’t hash the key because it has prevailing ordered joins in mind. A “key” in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that’s how the data is ordered in RAM). On the list is to add secondary keys, for example.
In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn’t be as bad now, since the known problem has been fixed.
The comparison with data.table is actually a bit interesting because the whole point of R’s data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas’ DataFrame contains no pre-computed information that is being used for the merge, so to speak it’s a “cold” merge. If I had stored the factorized versions of the join keys, the join would be significantly faster – as factorizing is the biggest bottleneck for this algorithm.
I should also add that the internal design of pandas’ DataFrame is much more amenable to these kinds of operations than R’s data.frame (which is just a list of arrays internally).
It would be interesting to know if Wes and/or Matt (who, by the way, are creators of Pandas and data.table respectively and have both commented above) have any news to add here as well.
This graph depicts the average times of aggregation and join operations for different technologies (lower = faster; comparison last updated in Sept 2016). It was really educational for me.
Going back to the question, R DT key and R DT refer to the keyed/unkeyed flavors of R’s data.table and happen to be faster in this benchmark than Python’s Pandas (Py pandas).
There are great answers, notably made by authors of both tools that question asks about.
Matt’s answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.
In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.
Timings I am presenting are coming from db-benchmark project, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.
As of now… (still to be implemented: one more data size and 5 more questions)
We tests 2 different data sizes of LHS table.
For each of those data sizes we run 5 different merge questions.
q1: LHS inner join RHS-small on integer
q2: LHS inner join RHS-medium on integer
q3: LHS outer join RHS-medium on integer
q4: LHS inner join RHS-medium on factor (categorical)
q5: LHS inner join RHS-big on integer
RHS table is of 3 various sizes
small translates to size of LHS/1e6
medium translates to size of LHS/1e3
big translates to size of LHS
In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).
As of now (run on 2nd Nov 2019)
pandas 0.25.3 released on 1st Nov 2019
data.table 0.12.7 (92abb70) released on 2nd Nov 2019
Below timings are in seconds, for two different data sizes of LHS. Column pd2dt is added field storing ratio of how many times pandas is slower than data.table.
cd impls/haxe
# Neko
make all-neko
neko ./stepX_YYY.n
# Python
make all-python
python3 ./stepX_YYY.py
# C++
make all-cpp
./cpp/stepX_YYY
# JavaScript
make all-js
node ./stepX_YYY.js
干草
MAL的Hy实现已经用Hy 0.13.0进行了测试
cd impls/hy
./stepX_YYY.hy
IO
已使用IO版本20110905测试了MAL的IO实现
cd impls/io
io ./stepX_YYY.io
珍妮特
MAIL的Janet实现已经使用Janet版本1.12.2进行了测试
cd impls/janet
janet ./stepX_YYY.janet
Java 1.7
mal的Java实现需要maven2来构建
cd impls/java
mvn compile
mvn -quiet exec:java -Dexec.mainClass=mal.stepX_YYY
# OR
mvn -quiet exec:java -Dexec.mainClass=mal.stepX_YYY -Dexec.args="CMDLINE_ARGS"
Java,将Truffle用于GraalVM
这个Java实现可以在OpenJDK上运行,但是多亏了Truffle框架,它在GraalVM上的运行速度可以提高30倍。它已经在OpenJDK 11、GraalVM CE 20.1.0和GraalVM CE 21.1.0上进行了测试
cd impls/java-truffle
./gradlew build
STEP=stepX_YYY ./run
JavaScript/节点
cd impls/js
npm install
node stepX_YYY.js
朱莉娅
Mal的Julia实现需要Julia 0.4
cd impls/julia
julia stepX_YYY.jl
JQ
针对1.6版进行了测试,IO部门存在大量作弊行为
cd impls/jq
STEP=stepA_YYY ./run
# with Debug
DEBUG=true STEP=stepA_YYY ./run
科特林
MAL的Kotlin实现已经使用Kotlin 1.0进行了测试
cd impls/kotlin
make
java -jar stepX_YYY.jar
LiveScript
已使用LiveScript 1.5测试了mal的LiveScript实现
cd impls/livescript
make
node_modules/.bin/lsc stepX_YYY.ls
徽标
MAL的Logo实现已经用UCBLogo 6.0进行了测试
cd impls/logo
logo stepX_YYY.lg
路亚
Mal的Lua实现已经使用Lua 5.3.5进行了测试。该实现需要安装luarock
cd impls/lua
make # to build and link linenoise.so and rex_pcre.so
./stepX_YYY.lua
cd impls/miniMAL
# Download miniMAL and dependencies
npm install
export PATH=`pwd`/node_modules/minimal-lisp/:$PATH
# Now run mal implementation in miniMAL
miniMAL ./stepX_YYY
make MAL_IMPL=IMPL "test^mal^step2"
# e.g.
make "test^mal^step2" # js is default
make MAL_IMPL=ruby "test^mal^step2"
make MAL_IMPL=python "test^mal^step2"
启动REPL
要在特定步骤中启动实施的REPL,请执行以下操作:
make "repl^IMPL^stepX"
# e.g
make "repl^ruby^step3"
make "repl^ps^step4"
如果您省略了这一步,那么stepA使用的是:
make "repl^IMPL"
# e.g
make "repl^ruby"
make "repl^ps"
# R# We recommend running this is a fresh R session or restarting your current session
install.packages(c("cmdstanr", "posterior"), repos= c("https://mc-stan.org/r-packages/", getOption("repos")))
# If you haven't installed cmdstan before, run:cmdstanr::install_cmdstan()
# Otherwise, you can point cmdstanr to your cmdstan path:cmdstanr::set_cmdstan_path(path=<yourexistingcmdstan>)
# Set the R_STAN_BACKEND environment variable
Sys.setenv(R_STAN_BACKEND="CMDSTANR")