by Valentin, August 5 2018, in tech

Timing a simple Python command-line tool

I programmed a little todolist tool for the command line, and I'm not entirely satisfied with its "performance". What bothers me is that I can notice a tiny delay between the moment I hit the todo command and the moment the output is written to the standard output. So I analysed how much time it took, and where the time was lost. My initial intuition was that most of the time was due to querying a SQlite database, but I was very wrong.

Note that any performance number given in this article comes from an entry-level Intel Pentium-powered computer. The program is faster on my newer laptop. All the measurements are about using the bare todo command which simply prints the list of pending tasks

Total time

To time the program, we can use the UNIX's time program:

$ time todo
real    0m0.097s
user    0m0.078s
sys 0m0.016s

It takes 97ms for the todo command to execute, hence the very little, but noticeable, delay. As a comparison, cat-ing a ~200 lines file takes about 6ms. So 97ms to print a bunch of tasks is really slow in absolute terms.

Script time

To investigate further, we need to use the cProfile Python module, which will break down the time taken by each function call and each module import. cProfile can be used as a stand-alone executable thanks to the -m Python interpreter's flag. We're sorting the function calls by cumulative time (meaning if a function is called multiple times then all those times are summed together) descending:

$ python -m cProfile -s cumtime ./

     34229 function calls (33095 primitive calls) in 0.050 seconds

 Ordered by: cumulative time


The important information here is that cProfile tells us that the code has been executed in 0.050 seconds. We are missing 47ms to be consistent with the 97ms total time. Well, surprise, those 47ms are actually the time the Python interpreter takes to start up, before even starting executing the script. This can be verified by timing the interpreter while it executes the empty program:

$ time python -c ''
real    0m0.045s
user    0m0.028s
sys 0m0.012s

45ms. Checks out with a little margin of error.

That is very bad. It means that almost 50% of the total time it takes for the program to run is due to Python starting up, which is a contextual aspect out of my control.

Let's analyse the other 50%.

Main time

I now look at the listing of function calls, sorted by cumulative time, descending (I've removed lines containing Python internal stuff):

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    0.050    0.050
      1    0.000    0.000    0.028    0.028
      1    0.000    0.000    0.021    0.022
      1    0.000    0.000    0.012    0.012
     35    0.000    0.000    0.010    0.000
      1    0.000    0.000    0.010    0.010
     29    0.000    0.000    0.010    0.000
      1    0.000    0.000    0.010    0.010
     22    0.000    0.000    0.010    0.000
     23    0.000    0.000    0.009    0.000
      1    0.000    0.000    0.008    0.008
      2    0.000    0.000    0.007    0.003
      2    0.000    0.000    0.007    0.003
      1    0.000    0.000    0.006    0.006
      2    0.000    0.000    0.006    0.003
   1660    0.001    0.000    0.006    0.000
   42/1    0.001    0.000    0.006    0.006
     22    0.000    0.000    0.006    0.000
  53/22    0.000    0.000    0.006    0.000
      1    0.000    0.000    0.004    0.004
      1    0.000    0.000    0.002    0.002

A bit a guidance here:

  • The cumtime is the column we're interested in
  • The filename:lineo(function) is the column giving us the performance culprits. In this last column, module imports are marked with the conventional name <module>. A module import is simply the source code from this module being executed from top to bottom (with the side-effect of its top-level variables being binded as attributes of the object the module is being imported as)
  • The first line is always the execution of the overall script (no different than importing it: execution from top to bottom) and therefore takes the entire time measured
  • The only thing that does is importing, and executing its main function. The only thing that does is importing libraries and defining functions.

The important information here is that the main function takes 28ms to execute. We are missing 22ms to be consistent with the overall 50ms execution time. Those 22ms are in the importation of, which is not satisfying, because does nothing. It imports libraries and defines functions. Well, turns out that in Python, importing libraries isn't nothing at all.

If we grep the previous listing with <module> to filter imports only, then we end up with:

    1    0.000    0.000    0.021    0.022
    1    0.000    0.000    0.010    0.010
    1    0.000    0.000    0.006    0.006
    1    0.000    0.000    0.004    0.004
    1    0.000    0.000    0.002    0.002

The first line is the root import of (22 ms) and the remaining lines are modules from the standard library, whose import's performances together sum up to the root 22 ms.

So, 75% of the execution time has nothing to do with my program. The first 50% is the interpreter starting up, and then 25% is due to importing stuff from the standard library. Any optimization that I can do on my code will be bounded under ~25% of the overall execution, that is 28 ms.

In fact, I found that imports can be somewhat optimized. In my case, there is one import that is different than the others: subprocess is only actually used for a very specific use-case which is to open a text editor when the user wants to edit a task. In Python, you don't necessarily have to do imports at the top of the script, you can do them inside a function. This means that I could avoid to systematically import subprocess, and only import it when necessary instead. This made me win 10 ms.

Business Logic time

Near the top of the listing, there is an important function call that takes quite some time:

    1    0.000    0.000    0.012    0.012

docopt is the function that parses the command-line arguments. It takes 12 ms out of 28 ms total in main. The command-line in question being: todo. In absolute terms, such performance is quite ridiculous.

So, finally, after the interpreter starting up, after the imports being done, after the command-line being parsed, my code, at last, executes in 16 ms.

The point of docopt is that you define the documentation of your command-line interface, and it automatically generates a parser for it (the opposite of e.g. the standard argparse that generates a documentation from a parser you build with its API). I studied a bit the performance of docopt and realized that a big performance hit was to parse the documentation text and build the parser from that. Once the parser is built, parsing the command-line is fast. This is sad, because the internal Python object representing the parser could be literally hard-coded in the source code, instead of being generated once again from the text everytime the program is invoked.

I though I would win some time by using the standard argparse since it builds a parser programmatically and doesn't have to parse text, but after experimenting I found out that I was wrong. With argparse, you add a command that your parser is able to parse with the add_argument method, and this method took 1 ms to execute. So since my program has more than 12 commands, it was slower than docopt's 12 ms.

Looks like parsing command lines is hard. How the hell does git even work out.

I didn't investigated much further into the business logic, because it's all broken down in little pieces of a few milliseconds or less that together sum up to the 16 ms. There is the notable 7 ms contribution of the function get_task_string_components, which, I found out, was so slow because some tasks in todo's output contain a deadline alongside their label, and Python is slow to convert an internal datetime object to a string representation when using the strftime function and a date format. You can make it an order of magnitude faster by doing it yourself with basic string formatting.

I searched for the fetch_all function, which is the function from the sqlite3 library that actually performs the query on the sqlite database. This was my initial intuition for the bottleneck. It was deep down the listing, with a cumtime of "0.000", which means it executes under half of a millisecond.


This was a toy analysis on my toy program, but I get valuable lessons from it:

  1. You can very wrong about the bottleneck. You should measure.
  2. If you want it to be snappy, you should use a compiled language
  3. If you really want to be serious about optimization, then you'll most likely need to abandon standard stuff and implement yourself the stack of some specific paths all the way to the bottom

No. 3 makes me wonder: when you do a Google Search, from your browser all the way to Google hardware, what is the percentage of it that is standard, and what is the percentage of it that is custom-made?