I programmed a little todolist tool for the command line, and I'm not entirely satisfied with its "performance". What bothers me is that I can notice a tiny delay between the moment I hit the
todo command and the moment the output is written to the standard output. So I analysed how much time it took, and where the time was lost. My initial intuition was that most of the time was due to querying a SQlite database, but I was very wrong.
Note that any performance number given in this article comes from an entry-level Intel Pentium-powered computer. The program is faster on my newer laptop. All the measurements are about using the bare
todo command which simply prints the list of pending tasks
To time the program, we can use the UNIX's
$ time todo [...] real 0m0.097s user 0m0.078s sys 0m0.016s
It takes 97ms for the
todo command to execute, hence the very little, but noticeable, delay. As a comparison,
cat-ing a ~200 lines file takes about 6ms. So 97ms to print a bunch of tasks is really slow in absolute terms.
To investigate further, we need to use the
cProfile Python module, which will break down the time taken by each function call and each module import.
cProfile can be used as a stand-alone executable thanks to the
-m Python interpreter's flag. We're sorting the function calls by cumulative time (meaning if a function is called multiple times then all those times are summed together) descending:
$ python -m cProfile -s cumtime ./todo.py 34229 function calls (33095 primitive calls) in 0.050 seconds Ordered by: cumulative time [...]
The important information here is that cProfile tells us that the code has been executed in
0.050 seconds. We are missing 47ms to be consistent with the 97ms total time. Well, surprise, those 47ms are actually the time the Python interpreter takes to start up, before even starting executing the script. This can be verified by timing the interpreter while it executes the empty program:
$ time python -c '' real 0m0.045s user 0m0.028s sys 0m0.012s
45ms. Checks out with a little margin of error.
That is very bad. It means that almost 50% of the total time it takes for the program to run is due to Python starting up, which is a contextual aspect out of my control.
Let's analyse the other 50%.
I now look at the listing of function calls, sorted by cumulative time, descending (I've removed lines containing Python internal stuff):
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.050 0.050 todo.py:3(
) 1 0.000 0.000 0.028 0.028 todocore.py:120(main) 1 0.000 0.000 0.021 0.022 todocore.py:36( ) 1 0.000 0.000 0.012 0.012 docopt.py:493(docopt) 35 0.000 0.000 0.010 0.000 re.py:277(_compile) 1 0.000 0.000 0.010 0.010 todo.py:527(feedback_todo) 29 0.000 0.000 0.010 0.000 todo.py:676(safe_print) 1 0.000 0.000 0.010 0.010 subprocess.py:356( ) 22 0.000 0.000 0.010 0.000 sre_compile.py:563(compile) 23 0.000 0.000 0.009 0.000 re.py:221(compile) 1 0.000 0.000 0.008 0.008 docopt.py:40(fix) 2 0.000 0.000 0.007 0.003 todo.py:595(get_basic_task_string) 2 0.000 0.000 0.007 0.003 todo.py:612(get_task_string_components) 1 0.000 0.000 0.006 0.006 configparser.py:119( ) 2 0.000 0.000 0.006 0.003 todo.py:683(get_datetime) 1660 0.001 0.000 0.006 0.000 docopt.py:34(__eq__) 42/1 0.001 0.000 0.006 0.006 docopt.py:45(fix_identities) 22 0.000 0.000 0.006 0.000 sre_parse.py:750(parse) 53/22 0.000 0.000 0.006 0.000 sre_parse.py:363(_parse_sub) 1 0.000 0.000 0.004 0.004 dbapi2.py:23( ) 1 0.000 0.000 0.002 0.002 datetime.py:5( ) [...]
A bit a guidance here:
cumtimeis the column we're interested in
filename:lineo(function)is the column giving us the performance culprits. In this last column, module imports are marked with the conventional name
<module>. A module import is simply the source code from this module being executed from top to bottom (with the side-effect of its top-level variables being binded as attributes of the object the module is being imported as)
- The first line is always the execution of the overall script (no different than importing it: execution from top to bottom) and therefore takes the entire time measured
- The only thing that
todo.pydoes is importing
todocore.py, and executing its
mainfunction. The only thing that
todocore.pydoes is importing libraries and defining functions.
The important information here is that the
main function takes 28ms to execute. We are missing 22ms to be consistent with the overall 50ms execution time. Those 22ms are in the importation of
todocore.py, which is not satisfying, because
todocore.py does nothing. It imports libraries and defines functions. Well, turns out that in Python, importing libraries isn't nothing at all.
If we grep the previous listing with
<module> to filter imports only, then we end up with:
1 0.000 0.000 0.021 0.022 todocore.py:36(
) 1 0.000 0.000 0.010 0.010 subprocess.py:356( ) 1 0.000 0.000 0.006 0.006 configparser.py:119( ) 1 0.000 0.000 0.004 0.004 dbapi2.py:23( ) 1 0.000 0.000 0.002 0.002 datetime.py:5( )
The first line is the root import of
todocore.py (22 ms) and the remaining lines are modules from the standard library, whose import's performances together sum up to the root 22 ms.
So, 75% of the execution time has nothing to do with my program. The first 50% is the interpreter starting up, and then 25% is due to importing stuff from the standard library. Any optimization that I can do on my code will be bounded under ~25% of the overall execution, that is 28 ms.
In fact, I found that imports can be somewhat optimized. In my case, there is one import that is different than the others:
subprocess is only actually used for a very specific use-case which is to open a text editor when the user wants to edit a task. In Python, you don't necessarily have to do imports at the top of the script, you can do them inside a function. This means that I could avoid to systematically import
subprocess, and only import it when necessary instead. This made me win 10 ms.
Business Logic time
Near the top of the listing, there is an important function call that takes quite some time:
1 0.000 0.000 0.012 0.012 docopt.py:493(docopt)
docopt is the function that parses the command-line arguments. It takes 12 ms out of 28 ms total in
main. The command-line in question being:
todo. In absolute terms, such performance is quite ridiculous.
So, finally, after the interpreter starting up, after the imports being done, after the command-line being parsed, my code, at last, executes in 16 ms.
The point of docopt is that you define the documentation of your command-line interface, and it automatically generates a parser for it (the opposite of e.g. the standard
argparse that generates a documentation from a parser you build with its API). I studied a bit the performance of docopt and realized that a big performance hit was to parse the documentation text and build the parser from that. Once the parser is built, parsing the command-line is fast. This is sad, because the internal Python object representing the parser could be literally hard-coded in the source code, instead of being generated once again from the text everytime the program is invoked.
I though I would win some time by using the standard
argparse since it builds a parser programmatically and doesn't have to parse text, but after experimenting I found out that I was wrong. With
argparse, you add a command that your parser is able to parse with the
add_argument method, and this method took 1 ms to execute. So since my program has more than 12 commands, it was slower than docopt's 12 ms.
Looks like parsing command lines is hard. How the hell does git even work out.
I didn't investigated much further into the business logic, because it's all broken down in little pieces of a few milliseconds or less that together sum up to the 16 ms. There is the notable 7 ms contribution of the function
get_task_string_components, which, I found out, was so slow because some tasks in todo's output contain a deadline alongside their label, and Python is slow to convert an internal datetime object to a string representation when using the
strftime function and a date format. You can make it an order of magnitude faster by doing it yourself with basic string formatting.
I searched for the
fetch_all function, which is the function from the sqlite3 library that actually performs the query on the sqlite database. This was my initial intuition for the bottleneck. It was deep down the listing, with a cumtime of "0.000", which means it executes under half of a millisecond.
This was a toy analysis on my toy program, but I get valuable lessons from it:
- You can very wrong about the bottleneck. You should measure.
- If you want it to be snappy, you should use a compiled language
- If you really want to be serious about optimization, then you'll most likely need to abandon standard stuff and implement yourself the stack of some specific paths all the way to the bottom
No. 3 makes me wonder: when you do a Google Search, from your browser all the way to Google hardware, what is the percentage of it that is standard, and what is the percentage of it that is custom-made?