Jupyter Notebook is the primary tool we use for data science tasks at Translucent. It allows for quick exploration of data and quick feedback from experimentations. Often the exploration and experimentations lead to long-running jobs in the cloud. We have vast amounts of computational resources in the cloud; locally, not so much. Most of the time, we start the data science process locally on our computers, so performance is essential when dealing with big data sets and complicated algorithms. Cloud resources and performance not withstanding, writing efficient code from the start does improve your whole processing pipeline.
There are 3 topics I’d like to cover in a series of posts:
When writing code, in the notebooks, there is a tendency to throw away code design patterns. You get into a rhythm where you segment your code into cells, and you forget about basic coding principals like encapsulation. You can make an argument for structuring the code outside the notebook and not using the notebook as a driver for development. I acknowledge that point, and for these posts, the notebook is the driver for code development.
I always encourage developers to start with a method instead of writing code directly in a cell. Create a method in the cell and encapsulate that code in the method. Also, in the spirit of TDD, the code writing process is flexible. You should always go back and refactor the code once your spidey sense detects code smell. Having code that is well structured helps to reason about the code.
Code Smell – https://refactoring.guru/refactoring/smells
I strongly disagree with the sentiment that you should not focus on performance or code quality initially, for the simple reason that we should not ignore computer science when doing data science. We should inject computer science back into data science. I’m getting too ranty. Good code structure helps with performance analysis using magic commands.
I define magic commands as a preprocessor. There are:
They do work before a line of code is executed in a cell or before a cell is executed. When it comes to performance, these are the commands available by default:
The next two commands, which I mostly use, require installation and loading into the notebook with a magic command:
You install both modules with conda or pip.
conda install -c conda-forge line_profiler
conda install -c conda-forge memory_profiler
You load the modules into the notebook with a load_ext magic command.
Here is my starting notebook.
Both cpu and memory line profilers are loaded, also autoreload module and others which I’ll get to in other posts. The autoreload module is there to help with code structure. The module reloads the code before each execution. Once you get locked into a TDD loop and you start refactoring code from the notebook into additional files, this module will reload the code in the additional files.
Our test example is a generation of synthetic data to be used with a machine learning algorithm. The first cell defines our method to generate some data. The second cell executes and profiles the method. The third cell is there to help visualize the data.
Every single line in the reallySlowGenerateTimeSeriesData() test method has to be re-coded.
Encapsulating the code in the reallySlowGenerateTimeSeriesData() method allows us to run the line profiler on the whole method. You can profile individual lines in the method. Doing that does not tell you how that line fits in the scope of the method, which might have other issues before the profiled line that leads to the slowdown of the given line.
The lprun magic command before the method in the second cell profiles the method and outputs the profile results. It gives the total execution time and a breakdown byline of the whole method. Each line has % of execution time and the number of times it was executed. With all this feedback, we can start to reason about the performance or the lack of performance in this implementation.
Full Code with memory profile – https://www.translucentcomputing.com/2019/12/slow-performance-test-notebook/
Do not ignore performance. We should use the profiler tools from the first line of code that we write, and structure the code to facilitate performance testing. During the initial phase, you most likely are re-running cells in your notebook many, many times. It all adds up, and you can find yourself in a position where you spend most of your day just waiting for slow code.
The total execution time of the really, really slow method was about 12 seconds. In the next post, I’ll transform the method line by line to get it executed in a fraction of a second.
December 31st, 2019
by Patryk Golabek in Applied Machine Learning
⟵ Back
See more:
December 10th, 2021
Cloud Composer – Terraform Deploymentby Patryk Golabek in Data-Driven, Technology
December 2nd, 2021
Provision Kubernetes: Securing Virtual MachinesAugust 6th, 2023
The Critical Need for Application Modernization in SMEs