class: center background-image: url(watch.jpg) # Python Performance Tuning # Coverage / Profiling / Benchmarking Michael Droettboom Space Telescope Science Institute --- class: right background-image: url(umbrella.jpg) # Coverage --- # Coverage ## coverage.py http://coverage.readthedocs.org/en/latest/ Install: ``` $ conda install coverage $ pip install coverage ``` --- ## Example ```python for i in range(10): if i > 0: print('Positive') elif i < 0: print('Negative!') ``` ## Collect results ``` $ coverage run coverage_example.py ``` ## Display results ``` $ coverage html ``` Open `htmlcov/index.html` in browser. --- ![center-aligned image](screenshot.png) --- # Integrating coverage with unit tests ### py.test Install pytest-cov plugin ``` $ py.test --cov-report=html projectname ``` ### nosetests Built-in coverage plugin ``` $ nosetests --cover-html projectname ``` ### astropy-helpers / package-template ``` $ setup.py test --coverage ``` --- # Coveralls https://coveralls.io/r/astropy/astropy --- class: center, bottom background-image: url(morningsky.jpg) # Profiling ## High-level profiling ## Functional profiling ## Line-by-line profiling --- # Background -- CPU speed -- Amount of memory -- Memory bus -- Cache -- Storage -- Network -- Platform -- Python -- Operating System --- # Background TL;DR: There's no such thing as tuning in general --- # Keepin' it simple ### bash ``` # time python psrecord_example.py real 0m1.828s user 0m1.377s sys 0m0.452s ``` ### zsh ``` $ time python example.py python example.py 1.39s user 0.43s system 100% cpu 1.816 total ``` ### time ``` $ /usr/bin/time python example.py 1.37user 0.44system 0:01.81elapsed 100%CPU (0avgtext+0avgdata 1658904maxresident)k 0inputs+0outputs (0major+417235minor)pagefaults 0swaps ``` --- # Problems? -- ### Not repeatable - Caching - State of machine etc. -- ### Includes too much... --- ``` $ strace python -c "" % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 19.74 0.000315 5 64 57 execve *14.91 0.000238 0 1001 901 open 11.28 0.000180 1 139 mmap 8.83 0.000141 10 14 7 wait4 8.65 0.000138 0 636 350 stat 6.52 0.000104 2 50 mprotect 4.39 0.000070 0 154 read 4.14 0.000066 1 120 close 3.45 0.000055 0 136 fstat 2.76 0.000044 1 47 brk 2.19 0.000035 1 48 munmap 2.07 0.000033 5 7 clone 2.01 0.000032 0 122 rt_sigaction 1.88 0.000030 2 19 7 access 1.00 0.000016 1 16 geteuid 0.88 0.000014 1 16 getuid 0.88 0.000014 1 16 getgid 0.81 0.000013 1 16 getegid 0.75 0.000012 2 7 arch_prctl 0.50 0.000008 0 64 rt_sigprocmask 0.50 0.000008 1 14 3 fcntl 0.44 0.000007 0 75 1 lseek 0.44 0.000007 1 7 getrlimit 0.19 0.000003 0 9 5 ioctl 0.19 0.000003 1 5 dup2 0.19 0.000003 1 3 uname 0.19 0.000003 1 3 getppid 0.13 0.000002 1 3 getpid 0.13 0.000002 1 3 getpgrp 0.00 0.000000 0 1 write 0.00 0.000000 0 42 8 lstat 0.00 0.000000 0 7 rt_sigreturn 0.00 0.000000 0 7 pipe 0.00 0.000000 0 10 getdents 0.00 0.000000 0 1 getcwd 0.00 0.000000 0 3 2 readlink 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 5 openat 0.00 0.000000 0 1 1 faccessat 0.00 0.000000 0 1 set_robust_list ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001596 2893 1342 total ``` --- # Simple, but more precise ``` $ ipython In [2]: import example In [3]: %timeit example.combine() 1 loops, best of 3: 1.74 s per loop ``` --- # ⚠ Detour 1: CPU or I/O bound? ```python import numpy as np x = np.ones((1024, 1024)) for i in range(100): y = np.random.rand(1024, 1024) x *= y ``` ``` python example.py 1.27s user 0.05s system 100% cpu 1.316 total ``` ```python import numpy as np from astropy.io import fits x = np.ones((1024, 1024)) for i in range(100): y = fits.open( 'http://stsdas.stsci.edu/download/mdroe/test.fits')[0].data x *= y ``` ``` python io_bound.py 0.51s user 0.18s system 2% cpu 26.045 total ``` --- # I: High-level profiling ## psrecord https://github.com/astrofrog/psrecord ``` $ pip install psrecord $ pip install psutil ``` Also requires matplotlib. --- ```python import numpy as np data_arrays = [] for i in range(100): data = np.random.rand(1024, 1024) data_arrays.append(data) combined = np.prod(data_arrays) ``` ## Record activity and plot ``` psrecord "python example.py" --log activity.txt --plot psrecord.png ``` --- ```python import numpy as np data_arrays = [] for i in range(100): data = np.random.rand(1024, 1024) data_arrays.append(data) combined = np.prod(data_arrays) ``` ![Default-aligned image](activity.png) --- ```python import numpy as np combined = np.ones((1024, 1024)) for i in range(100): data = np.random.rand(1024, 1024) combined *= data ``` ![Default-aligned image](activity2.png) --- # Alternatives ## valgrind's memcheck - Can answer *where* memory is being used at the C level ## memory_profiler - Line-by-line memory usage - Doesn't play nice with Numpy arrays --- # II. Functional profiling ### Gather data Use the `cProfile` module in the standard library. ### Visualize - Snakeviz ``` pip install snakeviz ``` - RunSnakeRun - kcachegrind --- ```python import numpy as np from astropy.io import fits def read_data(): return fits.open("test.fits")[0].data def write_data(data): fits.writeto("output.fits", data, clobber=True) def calculate(data): for i in range(5): data *= data def main(): data = read_data() calculate(data) write_data(data) if __name__ == '__main__': main() ``` --- ``` $ python -m cProfile -o profile.dat profile_example.py ``` ``` $ snakeviz profile.dat ``` --- profile_example.py ```python import numpy as np from astropy.io import fits def read_data(): return fits.open("test.fits")[0].data def write_data(data): fits.writeto("output.fits", data, clobber=True) def calculate(data): for i in range(5): data *= data def main(): data = read_data() calculate(data) write_data(data) if __name__ == '__main__': * import cProfile * import re * cProfile.run('main()', 'profile.dat') ``` --- profile_example2.py ```python import numpy as np from astropy.io import fits def read_data(): return fits.open("test.fits")[0].data def write_data(data): fits.writeto("output.fits", data, clobber=True) def calculate(data): * data **= (2 ** 5) def main(): data = read_data() calculate(data) write_data(data) if __name__ == '__main__': import cProfile import re cProfile.run('main()', 'profile.dat') ``` --- # ⚠ Detour 2: Things are not always what they seem -- Premature optimization is the root of all evil. — Donald Knuth --- # ⚠ Detour 2: Things are not always what they seem ~~Premature~~ blind optimization is the root of all evil. — Donald Knuth --- ```python def normalize(T): pass def intersection(A, B, C, D): pass def main(): pass if __name__ == '__main__': import cProfile cProfile.run('main()', 'profile.dat') ``` --- ```python import numpy as np from numpy.core.umath_tests import inner1d def normalize(T): l = np.sqrt(np.sum(T ** 2, axis=-1)) # Might get some divide-by-zeros, but we don't care with np.errstate(invalid='ignore'): TN = T / l return TN ``` --- ```python def intersection(A, B, C, D): ABX = np.cross(A, B) CDX = np.cross(C, D) T = np.cross(ABX, CDX) T = normalize(T) s = 0 s += np.sign(inner1d(np.cross(ABX, A), T)) s += np.sign(inner1d(np.cross(B, ABX), T)) s += np.sign(inner1d(np.cross(CDX, C), T)) s += np.sign(inner1d(np.cross(D, CDX), T)) if s == -4: return -T elif s == 4: return T else: return [np.nan, np.nan, np.nan] ``` --- ```python def main(): A = np.array([ 0.4246086 , 0.04379716, 0.90431706]) B = np.array([ 0.41771239, 0.08700592, 0.90440386]) C = np.array([ 0.35406757, 0.06995283, 0.9326 ]) D = np.array([ 0.34465183, 0.10538573, 0.93279631]) for x in range(500): intersection(A, B, C, D) if __name__ == '__main__': import cProfile cProfile.run('main()', 'profile.dat') ``` --- ```python import numpy as np from numpy.core.umath_tests import inner1d def cross3d(a, b): cp = np.empty(np.broadcast(a, b).shape) aT = a.T bT = b.T cpT = cp.T cpT[0] = aT[1]*bT[2] - aT[2]*bT[1] cpT[1] = aT[2]*bT[0] - aT[0]*bT[2] cpT[2] = aT[0]*bT[1] - aT[1]*bT[0] return cp ``` --- # Alternative GUIs ## RunSnakeRun ## kcachegrind --- # Line-by-line profiling ## kernprof ``` pip install line_profiler ``` --- ```python @profile def normalize(T): l = np.sqrt(np.sum(T ** 2, axis=-1)) # Might get some divide-by-zeros, but we don't care with np.errstate(invalid='ignore'): TN = T / l return TN ``` --- ``` $ kernprof -lv kernprof_example.py ``` ``` Wrote profile results to kernprof_example.py.lprof Timer unit: 1e-06 s Total time: 0.018569 s File: kernprof_example.py Function: normalize at line 18 Line # Hits Time Per Hit % Time Line Contents ============================================================== 18 @profile 19 def normalize(T): 20 500 4944 9.9 26.6 l = np.sqrt(np.sum(T ** 2, axis=-1)) 21 # Might get some divide-by-zeros, but we don't care 22 500 6783 13.6 36.5 with np.errstate(invalid='ignore'): 23 500 6615 13.2 35.6 TN = T / l 24 500 227 0.5 1.2 return TN ``` --- ```python @profile def normalize(T): l = np.sqrt(np.sum(T ** 2, axis=-1)) # Might get some divide-by-zeros, but we don't care with np.errstate(invalid='ignore'): TN = T / l return TN ``` ```python @profile def normalize(T): l = np.sqrt(np.sum(T ** 2, axis=-1)) if l == 0.0: return np.array([0, 0, 0]) else: return T / l ``` --- background-image: url(finish.jpg) # Benchmarking --- # Airspeed Velocity ## Tracking performance of a project over time http://www.astropy.org/astropy-benchmarks/ https://pv.github.io/scipy-bench/ ---