GSoC Week 2 - The Quest for Performances

Posted on May 23, 2018

Hey. This is me again, in the second week of my coding period with Google Summer of Code :)

Take a seat, and enjoy!

A word to feelings

First of all, I realize how this blog is starting to grow on me. I feel that by writing down the content of the mind’s buffer, one has the opportunity to face all that possibly unordered content. I will try to serve here content of quality, along with improving the very pleasantness of the portal itself.

Enough, let’s squeeze the juice.

The work so far

So, where were we?

In the previous post, gsoc-week1, I walked through the development of a mitmproxy addon to measure time of execution of the function in charge of dumping a flow state into a Protocol Buffer string.

Some initial results showed that, actually, moving to protobuf serialization could have a good impact:

AltText

But hey, dumping to bytes string makes only half of the process. My mentor suggested me to extend the performance analysis to include the writing of blob to disk.

So I did.

0x10-> Lite Databases and stuff

Writing to disk could be just easy as a call to one open function, plus a few lines of code.

Still, in order to make this testing process meaningful, I want my system to be as possible as it can be to the target I have the task to implement, during this summer. The whole reason behind the serialization revamp, is basically changing how mitmproxy stores and retrieves flows.

It should be dynamic. Flows should be stored the disk, retrieved by index, ordered in bunches, possibly through user-defined filters. And this will happen interactively, in a transparent and flexible way. Using file handles just doesn’t click.

Database systems come to the rescue. Using a DBMS, I can easily implement all the functionalities I was listing right before. And since I will just store blobs, along with some utility columns, SQLite seems just the right choice. In particular, quoting from sqlite.org:

SQLite does not compete with client/server databases. SQLite competes with fopen().

0x11-> A Dummy Schema

(MID INTEGER PRIMARY KEY, PBUF_BLOB BLOB)

This is it. I suppose that every piece of work name, in this phase of coding, starts with Dummy. Take it like a contract, an insurance between me and you. Ahead in the road, things should be a bit more complete :)

As you can see, there’s not much! The only things I need to build a functioning system, here, is…storing the blobs and marking them with a good ol’ numeric index!

0x20-> Verba volant, IDs manent

With the sqlite3 API ready, and a barebone schema, let’s connect our storage with the previously implemented protobuf. This is practically how dump and load interact with our DB:

store takes a blob, appends the current maximum mid and inserts the tuple into the DB. It returns that mid to the application, which can then use it as a ticket.
collect takes that ticket mid, which is used to retrieve the blob from the db.

0x21-> Transactions, and how to avoid (too many of) them

Dumping the same 4 MB body as in before, including DB insert, yields such results:

AltText

.05 seconds, for a single flow, is something far from what we should obtain.

But something worth noting: while insertion to DB takes much more time than tnetstring dump to file, performances in reading are still superior. That suggests something concerning how DBMS handles updates to databases: the loss in performances, as I pointed out in the GitHub PR discussion, is likely caused by every isolated transaction commit, implicit in with sqlite3.connect() context.

What’s next

The way I am approaching this “testing” period is truly helping me shaping ideas about how I will implement all the rest. The next points will be:

Implement lighter INSERTs to disk, grouping many of them in single transactions.
Explore asyncio concurrency, to make use of time “wasted” in disk I/O.
Improve what is implemented, fashioning less dummy code.

Til next week! Enjoy :)