January 2nd, 2020
January 2nd, 2020
Over the years, SciPy has emerged as one of the best frameworks for data science. SciPy defines itself as an ecosystem of open-source software for mathematics, science, and engineering. The core six packages of SciPy are NumPy, SciPy, Matplotlib, IPython, SymPy, and pandas. In this post, I’ll use NumPy and pandas to optimize the slow implementation from the last post.
Test Notebook – https://www.translucentcomputing.com/2020/01/pandas-and-numpy-performance-test-notebook/
Additional Notebook with restructured code – https://www.translucentcomputing.com/2020/01/performance-waveform-generator-starter-notebook/
The data science task is to generate synthetic time-series data. In the previous post, we created the first naive implementation.
def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
time = []
signal = []
# generate signal
sample_time = 0
for s in range(seconds):
for sps in range(samples_per_second):
sample_time += 1/samples_per_second
noise = random.random()
scaled_noise = -1 + (noise * 2)
sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
time.append(sample_time)
signal.append(sample)
# return time and signal
return [time,signal]
On average, running it locally, it takes about 14 seconds, really slow! It might not be clear from the profiler where you should start with the refactoring optimization.
Total time: 14.0719 s
File: <ipython-input-5-ad467de0be46>
Function: reallySlowGenerateTimeSeriesData at line 2
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
3 """Generate synthetic data"""
4
5 1 4.0 4.0 0.0 time = []
6 1 1.0 1.0 0.0 signal = []
7
8 # generate signal
9 1 1.0 1.0 0.0 sample_time = 0
10 3601 1889.0 0.5 0.0 for s in range(seconds):
11 3603600 1589537.0 0.4 11.3 for sps in range(samples_per_second):
12 3600000 2019769.0 0.6 14.4 sample_time += 1/samples_per_second
13 3600000 2079579.0 0.6 14.8 noise = random.random()
14 3600000 1921001.0 0.5 13.7 scaled_noise = -1 + (noise * 2)
15 3600000 2801165.0 0.8 19.9 sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
16 3600000 1904107.0 0.5 13.5 time.append(sample_time)
17 3600000 1754810.0 0.5 12.5 signal.append(sample)
18
19 # return time and signal
20 1 1.0 1.0 0.0 return [time,signal]
The time is evenly distributed across all the lines within the nested for loops. There is no smoking gun such as an obvious slow execution within the loops. Your software engineering spidey sense might be directing your eyes to the nested loops and the big O complexity, O(n²), of the code structure. Let’s start there.
The Python community and in general the scientific software development community has adopted the term “vectorization” to mean array programming (array-oriented computing), a process where you execute your “business logic” directly on the array without using loops. In Python, the NumPy library is the goto library when it comes to vectorization.
NumPy is an extension package to Python for array programming. It provides “closer to the hardware” optimization, which in Python means C implementation.
Looking at the first implementation, we see that the nested loop is used to create the time list. We are starting with refactoring the time list to use NumPy array.
def slightlyFasterGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
# generate time
time = np.arange(0,seconds,1/samples_per_second)
# generate signal
signal = []
for t in time:
noise = random.random()
scaled_noise = -1 + (noise * 2)
sample = math.sin(2*math.pi*10*t) + scaled_noise
signal.append(sample)
# return time and signal
return [time,signal]
This optimization reduced the average run time to about 10 seconds. Since it looks like we are going in the right direction, let’s try full NumPy implementation.
def reallyFastGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
# generate time
time = np.arange(0,seconds,1/samples_per_second)
# generate signal
noise = -2 * np.random.random(len(time)) + 1
signal = np.sin(2*np.pi*10*time) + noise
# return time and signal
return [time,signal]
WOW! This implementation executes in about 0.1 seconds, an order of magnitudes faster than previous implementations.
NumPy library comes with a vectorized version of most of the mathematical functions in Python core, random function, and a lot more. In this implementation, Python math and random functions were replaced with the NumPy version and the signal generation was directly executed on NumPy arrays without any loops.
When it comes to data manipulation and analysis, doing the data science thing, you use another library. While NumPy is the workhorse, pandas is the tool for doing data manipulation and analysis.
pandas is the library for data manipulation and analysis. Usually, it is the starting point for your data science tasks. It allows you to read/write data from/to multiple sources. Process the missing data, align your data, reshape it, merge and join it with other data, search data, group it, slice it, really a swiss knife.
Since most likely you will start with pandas, let’s refactor the code to use pandas by starting with the original implementation and adding pandas to it.
def pandasReallySlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
# generate time
time = np.arange(0,seconds,1/samples_per_second)
# create pandas
df = pd.DataFrame(data=time, columns=['time'])
def generateSignal(t):
noise = random.random()
scaled_noise = -1 + (noise * 2)
return math.sin(2*math.pi*10*t) + scaled_noise
# generate signal
df['signal'] = df['time'].apply(lambda t: generateSignal(t))
# return time and signal
return [df['time'],df['signal']]
On average this implementation runs in 5 seconds. pandas, like NumPy, has been designed to work with vectorization. Let’s update the code to use vectorization.
def pandasFasterSlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
# generate time
time = np.arange(0,seconds,1/samples_per_second)
# create pandas
df = pd.DataFrame(data=time, columns=['time'])
def generateSignal(t):
noise = -2 * np.random.random(len(t)) + 1
return np.sin(2*np.pi*10*t) + noise
# generate signal
df['signal'] = generateSignal(df['time'])
# return time and signal
return [df['time'],df['signal']]
This implementation runs in about 0.12 seconds. We are back to reasonable running times. pandas also integrates with NumPy and you can often squeeze out more performance by using NumPy arrays with pandas. Here is one such implementation.
def pandasNumpyFastSlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""
# generate time
time = np.arange(0,seconds,1/samples_per_second)
# create pandas
df = pd.DataFrame(data=time, columns=['time'])
def generateSignal(t):
noise = -2 * np.random.random(len(t)) + 1
return np.sin(2*np.pi*10*t) + noise
# generate signal
df['signal'] = generateSignal(df['time'].values)
# return time and signal
return [df['time'],df['signal']]
The change there is getting values from df[‘time’]. This implementation is slightly faster under the same test conditions and it scales nicely with a lot more data and additional processing.
# Data types
type(dataFrame['time'])
pandas.core.series.Series
type(dataFrame['time'].values)
numpy.ndarray
Since this is already a long post, I’ll get to SymPy in the next post. Here is a visualization of the generated synthetic time-series data.
You can see, and by definition, this noise data has no information in it. In the notebook, I show the initial steps of adding information into the data. The data is represented as a sound wave here, but if you change your perspective a bit, this process and data structure apply to other sources of time-series data, like a heartbeat.
The Python ecosystem is full of libraries and have been battle-tested and optimized for data science. The data science tasks are computationally intensive and writing efficient code from the start, not buying into the mantra which presupposes that premature optimization is the root of all evil, will allow you to work efficiently in your local environment before moving to the cloud.
January 2nd, 2020
by Patryk Golabek in Applied Machine Learning
⟵ Back
See more:
December 10th, 2021
Cloud Composer – Terraform Deploymentby Patryk Golabek in Data-Driven, Technology
December 2nd, 2021
Provision Kubernetes: Securing Virtual MachinesAugust 6th, 2023
The Critical Need for Application Modernization in SMEs