I regularly perform pandas operations on data frames in excess of 15 million or so rows and I’d love to have access to a progress indicator for particular operations.
Does a text based progress indicator for pandas split-apply-combine operations exist?
where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I’d like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.
So far, I’ve tried canonical loop progress indicators for Python but they don’t interact with pandas in any meaningful way.
I’m hoping there’s something I’ve overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.
Is this perhaps something that needs to be added to the library?
import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks
df = pd.DataFrame(np.random.randint(0, int(1e8),(10000,1000)))# Create and register a new `tqdm` instance with `pandas`# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)
Due to popular demand, tqdm has added support for pandas. Unlike the other answers, this will not noticeably slow pandas down — here’s an example for DataFrameGroupBy.progress_apply:
import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Create and register a new `tqdm` instance with `pandas`
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)
In case you’re interested in how this works (and how to modify it for your own callbacks), see the examples on github, the full documentation on pypi, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.
EDIT
To directly answer the original question, replace:
Note: the apply progress percentage updates inline. If your function stdouts then this won’t work.
In [11]: g = df_users.groupby(['userID', 'requestDate'])
In [12]: f = feature_rollup
In [13]: logged_apply(g, f)
apply progress: 100%
Out[13]:
...
As usual you can add this to your groupby objects as a method:
from pandas.core.groupby import DataFrameGroupBy
DataFrameGroupBy.logged_apply = logged_apply
In [21]: g.logged_apply(f)
apply progress: 100%
Out[21]:
...
As mentioned in the comments, this isn’t a feature that core pandas would be interested in implementing. But python allows you to create these for many pandas objects/methods (doing so would be quite a bit of work… although you should be able to generalise this approach).
For anyone who’s looking to apply tqdm on their custom parallel pandas-apply code.
(I tried some of the libraries for parallelization over the years, but I never found a 100% parallelization solution, mainly for the apply function, and I always had to come back for my “manual” code.)
df_multi_core – this is the one you call. It accepts:
Your df object
The function name you’d like to call
The subset of columns the function can be performed upon (helps reducing time / memory)
The number of jobs to run in parallel (-1 or omit for all cores)
Any other kwargs the df’s function accepts (like “axis”)
_df_split – this is an internal helper function that has to be positioned globally to the running module (Pool.map is “placement dependent”), otherwise I’d locate it internally..
here’s the code from my gist (I’ll add more pandas function tests there):
import pandas as pd
import numpy as np
import multiprocessing
from functools import partial
def _df_split(tup_arg, **kwargs):
split_ind, df_split, df_f_name = tup_arg
return (split_ind, getattr(df_split, df_f_name)(**kwargs))
def df_multi_core(df, df_f_name, subset=None, njobs=-1, **kwargs):
if njobs == -1:
njobs = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=njobs)
try:
splits = np.array_split(df[subset], njobs)
except ValueError:
splits = np.array_split(df, njobs)
pool_data = [(split_ind, df_split, df_f_name) for split_ind, df_split in enumerate(splits)]
results = pool.map(partial(_df_split, **kwargs), pool_data)
pool.close()
pool.join()
results = sorted(results, key=lambda x:x[0])
results = pd.concat([split[1] for split in results])
return results
Bellow is a test code for a parallelized apply with tqdm “progress_apply”.
from time import time
from tqdm import tqdm
tqdm.pandas()
if __name__ == '__main__':
sep = '-' * 50
# tqdm progress_apply test
def apply_f(row):
return row['c1'] + 0.1
N = 1000000
np.random.seed(0)
df = pd.DataFrame({'c1': np.arange(N), 'c2': np.arange(N)})
print('testing pandas apply on {}\n{}'.format(df.shape, sep))
t1 = time()
res = df.progress_apply(apply_f, axis=1)
t2 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for native implementation {}\n{}'.format(round(t2 - t1, 2), sep))
t3 = time()
# res = df_multi_core(df=df, df_f_name='apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
res = df_multi_core(df=df, df_f_name='progress_apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
t4 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for multi core implementation {}\n{}'.format(round(t4 - t3, 2), sep))
In the output you can see 1 progress bar for running without parallelization, and per-core progress bars when running with parallelization.
There is a slight hickup and sometimes the rest of the cores appear at once, but even then I think its usefull since you get the progress stats per core (it/sec and total records, for ex)
Thank you @abcdaa for this great library!
回答 4
您可以使用装饰器轻松完成此操作
from functools import wraps
def logging_decorator(func):@wrapsdef wrapper(*args,**kwargs):
wrapper.count +=1print"The function I modify has been called {0} times(s).".format(
wrapper.count)
func(*args,**kwargs)
wrapper.count =0return wrapper
modified_function = logging_decorator(feature_rollup)
from functools import wraps
def logging_decorator(func):
@wraps
def wrapper(*args, **kwargs):
wrapper.count += 1
print "The function I modify has been called {0} times(s).".format(
wrapper.count)
func(*args, **kwargs)
wrapper.count = 0
return wrapper
modified_function = logging_decorator(feature_rollup)
then just use the modified_function (and change when you want it to print)
I’ve changed Jeff’s answer, to include a total, so that you can track progress and a variable to just print every X iterations (this actually improves the performance by a lot, if the “print_at” is reasonably high)
Sometimes I rerun a script within the same ipython session and I get bad surprises when variables haven’t been cleared. How do I clear all variables?
And is it possible to force this somehow every time I invoke the magic command %run?
I’m trying to run a script that launches, amongst other things, a python script. I get a ImportError: No module named …, however, if I launch ipython and import the same module in the same way through the interpreter, the module is accepted.
What’s going on, and how can I fix it? I’ve tried to understand how python uses PYTHONPATH but I’m thoroughly confused. Any help would greatly appreciated.
This issue arises due to the ways in which the command line IPython interpreter uses your current path vs. the way a separate process does (be it an IPython notebook, external process, etc). IPython will look for modules to import that are not only found in your sys.path, but also on your current working directory. When starting an interpreter from the command line, the current directory you’re operating in is the same one you started ipython in. If you run
import os
os.getcwd()
you’ll see this is true.
However, let’s say you’re using an ipython notebook, run os.getcwd() and your current working directory is instead the folder in which you told the notebook to operate from in your ipython_notebook_config.py file (typically using the c.NotebookManager.notebook_dir setting).
The solution is to provide the python interpreter with the path-to-your-module. The simplest solution is to append that path to your sys.path list. In your notebook, first try:
If that doesn’t work, you’ve got a different problem on your hands unrelated to path-to-import and you should provide more info about your problem.
The better (and more permanent) way to solve this is to set your PYTHONPATH, which provides the interpreter with additional directories look in for python packages/modules. Editing or setting the PYTHONPATH as a global var is os dependent, and is discussed in detail here for Unix or Windows.
ipython profile create
... CHECK the path prompted ,and edit the prompted config file like my case
vi /home/osboxes/.ipython/profile_default/ipython_kernel_config.py
ipython profile create
... CHECK the path prompted , and edit the prompted config file like my case
vi /home/osboxes/.ipython/profile_default/ipython_kernel_config.py
Doing sys.path.append('my-path-to-module-folder') will work, but to avoid having to do this in IPython every time you want to use the module, you can add export PYTHONPATH="my-path-to-module-folder:$PYTHONPATH" to your ~/.bash_profile file.
I have ipython installed both locally and in commonly in virtualenvs. My problem was that, inside a newly made virtualenv with ipython, the system ipython was picked up, which was a different version than the python and ipython in the virtualenv (a 2.7.x vs. a 3.5.x), and hilarity ensued.
I think the smart thing to do whenever installing something that will have a binary in yourvirtualenv/bin is to immediately run rehash or similar for whatever shell you are using so that the correct python/ipython gets picked up. (Gotta check if there are suitable pip post-install hooks…)
回答 11
不带脚本的解决方案:
打开Spyder->工具-> PYTHONPATH管理器
通过单击“添加路径”来添加Python路径。例如:“ C:\ Users \ User \ AppData \ Local \ Programs \ Python \ Python37 \ Lib \ site-packages”
This kind of errors occurs most probably due to python version conflicts. For example, if your application runs only on python 3 and you got python 2 as well, then it’s better to specify which version to use.
For example use
I am using ipython Jupyter notebook. Let’s say I defined a function that occupies a lot of space on my screen. Is there a way to collapse the cell?
I want the function to remain executed and callable, yet I want to hide / collapse the cell in order to better visualize the notebook. How can I do this?
The jupyter contrib nbextensions Python package contains a code-folding extension that can be enabled within the notebook. Follow the link (Github) for documentation.
To make life easier in managing them, I’d also recommend the jupyter nbextensions configurator package. This provides an extra tab in your Notebook interface from where you can easily (de)activate all installed extensions.
I had a similar issue and the “nbextensions” pointed out by @Energya worked very well and effortlessly. The install instructions are straight forward (I tried with anaconda on Windows) for the notebook extensions and for their configurator.
That said, I would like to add that the following extensions should be of interest.
Hide Input |
This extension allows hiding of an individual codecell in a notebook. This can be achieved by clicking on the toolbar button:
Collapsible Headings | Allows notebook to have collapsible sections, separated by headings
Codefolding | This has been mentioned but I add it for completeness
回答 4
在〜/ .jupyter / custom /内创建具有以下内容的custom.js文件:
$("<style type='text/css'> .cell.code_cell.collapse { max-height:30px; overflow:hidden;} </style>").appendTo("head");
$('.prompt.input_prompt').on('click', function(event){
console.log("CLICKED", arguments)
var c = $(event.target.closest('.cell.code_cell'))if(c.hasClass('collapse')){
c.removeClass('collapse');}else{
c.addClass('collapse');}});
Second is the key: After opening jupiter notebook, click the Nbextension tab. Now Search “colla” from the searching tool provided by Nbextension(not by the web browser), then you will find something called “Collapsible Headings”
As others have mentioned, you can do this via nbextensions. I wanted to give the brief explanation of what I did, which was quick and easy:
To enable collabsible headings:
In your terminal, enable/install Jupyter Notebook Extensions by first entering:
pip install jupyter_contrib_nbextensions
Then, enter:
jupyter contrib nbextension install
Re-open Jupyter Notebook. Go to “Edit” tab, and select “nbextensions config”.
Un-check box directly under title “Configurable nbextensions”, then select “collapsible headings”.
There are many answers to this question, all of which I feel are not satisfactory (some more than others), of the many extensions – code folding, folding by headings etc etc. None do what I want in simple and effective way. I am literally amazed that a solution has not been implemented (as it has for Jupyter Lab).
In fact, I was so dissatisfied that I have developed a very simple notebook extension that can expand/collapse the code in a notebook cell, while keeping it executable.
# Run me to hide code cellsfromIPython.core.display import display, HTML
display(HTML(r"""<style id=hide>div.input{display:none;}</style><button type="button"onclick="var myStyle = document.getElementById('hide').sheet;myStyle.insertRule('div.input{display:inherit !important;}', 0);">Show inputs</button>"""))
# Run me to hide code cells
from IPython.core.display import display, HTML
display(HTML(r"""<style id=hide>div.input{display:none;}</style><button type="button"onclick="var myStyle = document.getElementById('hide').sheet;myStyle.insertRule('div.input{display:inherit !important;}', 0);">Show inputs</button>"""))
Error reports from most language kernels running in IPython/Jupyter Notebooks indicate the line on which the error occurred; but (at least by default) no line numbers are indicated in Notebooks.
Is it possibile to add the line numbers to IPython/Jupyter Notebooks?
CTRL – ML toggles line numbers in the CodeMirror area. See the QuickHelp for other keyboard shortcuts.
In more details CTRL – M (or ESC) bring you to command mode, then pressing the L keys should toggle the visibility of current cell line numbers. In more recent notebook versions Shift-L should toggle for all cells.
If you can’t remember the shortcut, bring up the command palette Ctrl-Shift+P (Cmd+Shift+P on Mac), and search for “line numbers”), it should allow to toggle and show you the shortcut.
For me, ctrl + m is used to save the webpage as png, so it does not work properly. But I find another way.
On the toolbar, there is a bottom named open the command paletee, you can click it and type in the line, and you can see the toggle cell line number here.
You can also find Toggle Line Numbers under View on the top toolbar of the Jupyter notebook in your browser.
This adds/removes the lines numbers in all notebook cells.
For me, Esc+l only added/removed the line numbers of the active cell.
Just for reference as it was something I was looking for, you can test for presence within the values or the index by appending the “.values” method, e.g.
g in df.<your selected field>.values
g in df.index.values
I find that adding the “.values” to get a simple list or ndarray out makes exist or “in” checks run more smoothly with the other python tools. Just thought I’d toss that out there for people.
Code below does not print boolean, but allows for dataframe subsetting by index… I understand this is likely not the most efficient way to solve the problem, but I (1) like the way this reads and (2) you can easily subset where df1 index exists in df2:
What exactly is the difference between Python and IPython?
If I write code in Python, will it run in IPython as is or does it need to be modified?
I know IPython is supposed to be an interactive shell for Python, but is that all? Or is there a language called IPython? If I write something under IPython, will it run in Python, and vice-versa? If there are differences, how do I know what they are? Will all packages used by Python work as is in IPython?
ipython is an interactive shell built with python.
From the project website:
IPython provides a rich toolkit to help you make the most out of using Python, with:
Powerful Python shells (terminal and Qt-based).
A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
Support for interactive data visualization and use of GUI toolkits.
Flexible, embeddable interpreters to load into your own projects.
Easy to use, high performance tools for parallel computing.
Note that the first 2 lines tell you it helps you make the most of using Python. Thus, you don’t need to alter your code, the IPython shell runs your python code just like the normal python shell does, only with more features.
I recommend reading the IPython tutorial to get a sense of what features you gain when using IPython.
Even after viewing this thread, I had thought that ipython was a synonym for the python shell, in other words that typing python at the command line put one into ipython mode.
It is in fact, as referenced above, a very cool interactive shell (command line program) that can be installed from iPython.org or simply by running
IPython is a powerful interactive Python interpreter that is more interactive comparing to the standard interpreter.
To get the standard Python interpreter you type python and you will get the >>> prompt from where you can work.
To get IPython interpreter, you need to install it first. pip install ipython.
You type ipython and you get In [1]: as a prompt and you get In [2]: for the next command. You can call history to check the list of previous commands, and write %recall 1 to recall the command.
Even you are in Python you can run shell commands directly like !ping www.google.com.
Looks like a command line Jupiter notebook if you used that before.
You can use [Tab] to autocomplete as shown in the image.
There are few differences between Python and IPython but they are only the interpretation of few syntax like the few mentioned by @Ryan Chase but deep inside the true flavor of Python is maintained even in the Ipython.
The best part of the IPython is the IPython notebook. You can put all your work into a notebook like script, image files, etc. But with base Python, you can only make the script in a file and execute it.
At start, you need to understand that the IPython is developed with the intention of supporting rich media and Python script in a single integrated container.
Compared to Python, IPython (created by Fernando Perez in 2001) can do every thing what python can do. Ipython provides even extra features like tab-completion, testing, debugging, system calls and many other features. You can think IPython as a powerful interface to the Python language.
You can install Ipython using pip – pip install ipython
You can run Ipython by typing ipython in your terminal window.
From my experience I’ve found that some commands which run in IPython do not run in base Python. For example, pwd and ls don’t work alone in base Python. However they will work if prefaced with a % such as: %pwd and %ls.
Also, in IPython, you can run the cd command like: cd C:\Users\… This doesn’t seem to work in base python, even when prefaced with a % however.
I am starting to depend heavily on the IPython notebook app to develop and document algorithms. It is awesome; but there is something that seems like it should be possible, but I can’t figure out how to do it:
I would like to insert a local image into my (local) IPython notebook markdown to aid in documenting an algorithm. I know enough to add something like <img src="image.png"> to the markdown, but that is about as far as my knowledge goes. I assume I could put the image in the directory represented by 127.0.0.1:8888 (or some subdirectory) to be able to access it, but I can’t figure out where that directory is. (I’m working on a mac.) So, is it possible to do what I’m trying to do without too much trouble?
Files inside the notebook dir are available under a “files/” url. So if it’s in the base path, it would be <img src="files/image.png">, and subdirs etc. are also available: <img src="files/subdir/image.png">, etc.
Update: starting with IPython 2.0, the files/ prefix is no longer needed (cf. release notes). So now the solution <img src="image.png"> simply works as expected.
Most of the answers given so far go in the wrong direction, suggesting to load additional libraries and use the code instead of markup. In Ipython/Jupyter Notebooks it is very simple. Make sure the cell is indeed in markup and to display a image use:
![alt text](imagename.png "Title")
Further advantage compared to the other methods proposed is that you can display all common file formats including jpg, png, and gif (animations).
However, I found that the images appeared broken in Print View (on my Windows machine running the Anaconda distribution of IPython version 0.13.2 in a Chrome browser)
The workaround for this was to use <img src="../files/image.png"> instead.
This made the image appear correctly in both Print View and the normal iPython editing view.
UPDATE: as of my upgrade to iPython v1.1.0 there is no more need for this workaround since the print view no longer exists. In fact, you must avoid this workaround since it prevents the nbconvert tool from finding the files.
I would like to use an IPython notebook as a way to interactively analyze some genome charts I am making with Biopython’s GenomeDiagram module. While there is extensive documentation on how to use matplotlib to get graphs inline in IPython notebook, GenomeDiagram uses the ReportLab toolkit which I don’t think is supported for inline graphing in IPython.
I was thinking, however, that a way around this would be to write out the plot/genome diagram to a file and then open the image inline which would have the same result with something like this:
If you are trying to display an Image in this way inside a loop, then you need to wrap the Image constructor in a display method.
from IPython.display import Image, display
listOfImageNames = ['/path/to/images/1.png',
'/path/to/images/2.png']
for imageName in listOfImageNames:
display(Image(filename=imageName))
Note, until now posted solutions only work for png and jpg!
If you want it even easier without importing further libraries or you want to display an animated or not animated GIF File in your Ipython Notebook. Transform the line where you want to display it to markdown and use this nice short hack!
import matplotlib.pyplot as plt
from PIL importImageimport numpy as np
pil_im =Image.open('image.png')#Take jpg + png## Uncomment to open from URL#import requests#r = requests.get('https://www.vegvesen.no/public/webkamera/kamera?id=131206')#pil_im = Image.open(BytesIO(r.content))
im_array = np.asarray(pil_im)
plt.imshow(im_array)
plt.show()
There are some other useful functions in that package where you can display images in interactive tabs (separate tab for each label/class) which is very helpful for all the ML classification tasks.
When using GenomeDiagram with Jupyter (iPython), the easiest way to display images is by converting the GenomeDiagram to a PNG image. This can be wrapped using an IPython.display.Image object to make it display in the notebook.
In[1]:%%time
1
CPU times: user 4µs, sys:0 ns, total:4µs
Wall time:5.96µs
Out[1]:1In[2]:%%time
# Notice there is no out result in this case.
x =1
x
CPU times: user 3µs, sys:0 ns, total:3µs
Wall time:5.96µs
I would like to get the time spent on the cell execution in addition to the original output from cell.
To this end, I tried %%timeit -r1 -n1 but it doesn’t expose the variable defined within cell.
%%time works for cell which only contains 1 statement.
In[1]: %%time
1
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 5.96 µs
Out[1]: 1
In[2]: %%time
# Notice there is no out result in this case.
x = 1
x
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
The only way I found to overcome this problem is by executing the last statement with print.
Do not forget that cell magic starts with %% and line magic starts with %.
%%time
clf = tree.DecisionTreeRegressor().fit(X_train, y_train)
res = clf.predict(X_test)
print(res)
Notice that any changes performed inside the cell are not taken into consideration in the next cells, something that is counter intuitive when there is a pipeline:
[1]%%time
import pandas as pd
from pyspark.ml importPipelinefrom pyspark.ml.classification importLogisticRegressionimport numpy as np
.... code ....Output:-
CPU times: user 59.8 s, sys:4.97 s, total:1min4sWall time:1min18s
I simply added %%time at the beginning of the cell and got the time. You may use the same on Jupyter Spark cluster/ Virtual environment using the same. Just add %%time at the top of the cell and you will get the output. On spark cluster using Jupyter, I added to the top of the cell and I got output like below:-
[1] %%time
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
import numpy as np
.... code ....
Output :-
CPU times: user 59.8 s, sys: 4.97 s, total: 1min 4s
Wall time: 1min 18s
回答 5
import time
start = time.time()"the code you want to test stays here"
end = time.time()print(end - start)
Sometimes the formatting is different in a cell when using print(res), but jupyter/ipython comes with a display. See an example of the formatting difference using pandas below.
%%time
import pandas as pd
from IPython.display import display
df = pd.DataFrame({"col0":{"a":0,"b":0}
,"col1":{"a":1,"b":1}
,"col2":{"a":2,"b":2}
})
#compare the following
print(df)
display(df)
The display statement can preserve the formatting.
回答 9
您可能还需要查看python的分析魔术命令%prun,该命令给出类似以下内容的信息-
def sum_of_lists(N):
total =0for i in range(5):
L =[j ^(j >> i)for j in range(N)]
total += sum(L)return total
然后
%prun sum_of_lists(1000000)
将返回
14 function calls in0.714 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)50.5990.1200.5990.120<ipython-input-19>:4(<listcomp>)50.0640.0130.0640.013{built-in method sum}10.0360.0360.6990.699<ipython-input-19>:1(sum_of_lists)10.0140.0140.7140.714<string>:1(<module>)10.0000.0000.7140.714{built-in method exec}
I find it useful when working with large chunks of code.
回答 10
遇到麻烦时,意味着什么:
?%timeit 要么 ??timeit
要获取详细信息:
Usage,in line mode:%timeit [-n<N>-r<R>[-t|-c]-q -p<P>-o] statement
orin cell mode:%%timeit [-n<N>-r<R>[-t|-c]-q -p<P>-o] setup_code
code
code...Time execution of a Python statement or expression using the timeit
module.This function can be used both as a line and cell magic:-In line mode you can time a single-line statement (though multiple
ones can be chained with using semicolons).-In cell mode, the statement in the first line is used as setup code
(executed but not timed)and the body of the cell is timed.The cell
body has access to any variables created in the setup code.
Usage, in line mode:
%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] statement
or in cell mode:
%%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] setup_code
code
code...
Time execution of a Python statement or expression using the timeit
module. This function can be used both as a line and cell magic:
- In line mode you can time a single-line statement (though multiple
ones can be chained with using semicolons).
- In cell mode, the statement in the first line is used as setup code
(executed but not timed) and the body of the cell is timed. The cell
body has access to any variables created in the setup code.