Installing WxPython and RunSnakeRun on Mac OSX 10.9

I just posted about my Ctrl-C strategy for profiling and now I’m going to completely flip-flop and explain how I installed RunSnakeRun, a way to visualize the output of Python’s cProfile. The Ctrl-C way of profiling worked really well for optimizing append performance of my time series storage library, but doesn’t work so great for profiling things that are already very fast (on the order of milliseconds) and need to be faster.

For that, RunSnakeRun worked really well. RunSnakeRun gives you a nice rectangle chart showing in which functions your program spends most of its time.

RunSnakeRun's rectangle plot of function cumulative run time

RunSnakeRun’s rectangle plot of function cumulative run time

To install RunSnakeRun on Mac OSX, you’ll need Homebrew and PIP. You can install it like this:

Next, you’ll need to export a pstats database with your profiler information. Use cProfile to do this. For TsTables, you can run the benchmark with profiling information like this:

This will create a tstables.profile file in the current directory, which you can open with RunSnakeRun. Start RunSnakeRun by running runsnake  (assuming that PIP’s bin folder is in your path).

The barebones profiling method that is surprisingly effective

I’m working on profiling my time series database (TsTables) because append performance is not what I want it to be. I know that the issue is a few loops that are written in Python instead of using NumPy’s optimized vector operations. I’m not exactly sure which loop is the slowest.

I started trying to get cProfile to work, but ended up with way too much data to be useful. So I reverted to my old school, barebones profiling method: Ctrl-C.

How do you use this method you might ask? Start your program and randomly hit Ctrl-C. Wherever your program stops most frequently is the probably the slowest part. Speed that up and repeat!

Two Podcasts I’ve been Enjoying Recently

Since I’ve moved offices to WeWork Chinatown, I have about a 20 minute commute on Metro. The nice thing about commutes is that you have some downtime, which recently I’ve been using to listen to podcasts.  If you’re looking for podcast recommendations, two good ones are Monocle’s The Entrepreneurs and the Tim Ferriss Show.

The Entrepreneurs kind of makes you feel like you are listening to BBC’s World Service or NPR, but the content is focused on business and entrepreneurship. It has a refreshing international and non-technology bent, which is great for getting out of the U.S. technology world and realizing that every new business doesn’t need to be a website started in San Francisco!

The Tim Ferriss Show is by the author of the Four Hour Body and Four Hour Workweek, both NY Times bestsellers. The format is much longer than The Entrepreneurs, and is more of a conversation than a highly-produced magazine-style show. There are only a few episodes of this one, but so far the guests have been very interesting and Tim does a good job interviewing them in depth.

The Ultimate Guide to Dealing with High Frequency Data

I have no idea if this is actually the ultimate guide to high frequency data, but hopefully it is at least a useful guide!

I’m currently working on a project to replace the Time Series Database (TSDB) that I wrote a few years ago. By re-writing it, I’m learning a lot about what works and what doesn’t when dealing with high frequency data. High frequency in this case means billions of records, with timestamp precision down to the millisecond. This data is being used for economic research and analytics, not live trading. It is a pain to deal with because it is simply too large to fit in memory or process with a standard relational database. These are some rules that I’ve developed while working on this project.

Partition your data!

The biggest issue with this much time series data is that you simply cannot index the timestamp column with any normal indexing scheme. In DB-speak, the timestamp column will be a “high cardinality” column, which means it does not lend itself well to indexing. This is a problem because most queries on this kind of high frequency data are to fetch a subset by timestamp, and you do NOT want to make a table scan of a billion plus records to find a few minutes of data.

TSDB attempts to fix this problem by keeping rows in order and creating a sparse index (an index on every 20,000 rows). This should work in theory, but you must ensure that your rows are always sequential. That makes inserting or updating difficult, because you potentially need to shift rows to maintain the order. Also, I’m not aware of a relational database that lets you create a sparse index, which rules out the most common and best understood data stores.

Another approach is to partition your data. This is my current approach. The way this works is you simply create multiple tables for set time periods (one table per day or month is a good starting point). You then put records that match those time periods in their respective tables and write some logic to union queries across tables.

Partitioning enables the database to hunt through a subset of rows to find the ones that match your query. Both Postgres and MySQL support partitioning, making them viable options for storing time series data. The library that I’m working on will use PyTables to partition time series data by date.

Store timestamps in UTC

Most of the time, your source data will have timestamps in UTC. If it doesn’t, I suggest you convert to UTC before storing. Most libraries use either UTC or local time internally, and because you can never be sure what time zone your users will be in, using UTC is the least common denominator.

UTC also has the nice property of not having daylight saving time changes. DST causes all sorts of pain when working with 24/7 data. Avoid it by just dealing in UTC internally, and then converting to other timezones for querying or display.

Store timestamps as integers, even if your programming language uses floats

MATLAB, Excel, and R all store timestamps internally as floats by default. This gives their timestamp types a large range and high precision, but I don’t think it is appropriate for archiving time series data. Why? Floats are imprecise. You do not know with any accuracy the number of significant digits when using a float, and you cannot make comparisons without worrying about round off errors. Admittedly, even with microsecond data, these systems that use a 64-bit double and 1970-01-01 as the epoch will not loose precision until 2242-03-16, but why worry about it? I recommend a 64-bit integer as the timestamp column. With one tick equaling one millisecond, you have a time range of ±292 million years. This is Java’s internal representation. With one tick equaling 100 nanoseconds (0.1 microsecond), you have a time range of ±29,227 years, which is what Win32 does. Should be plenty!

Have a solid ETL procedure

ETL means “extract, transform, load” and is the term for taking data out of one format or system and loading it into another. The step to make sure is solid when you are dealing with high frequency data is the “load” step. Try to make a process where you can revert a botched load automatically. If you don’t do this, guaranteed someone or something will screw up an import, and you will be left wading through millions of rows of data to fix it or re-importing everything from your source system.

The most basic way to make the load step revertible is to just make a copy of your time series before writing anything to it. You could devise a more sophisticated process, like using rsync to make diffs of your time series, but be nice to your future self and make backups at the very least!

The Business Model Sketch

I was helping a friend work through some business ideas, and realized that like writing an outline helps you structure a essay, doing a “business model sketch” can help you break apart a business idea and evaluate its viability. Just like an essay needs a thesis, a body and a conclusion or it can fall flat, a business model needs specific components or it is not viable. I identified six components that roughly draw from The Personal MBA and the “lean startup” philosophy. If you are thinking of starting a new business, consider taking one page of paper, drawing six boxes, and filling out these six areas: customer, value proposition, marketing, sales, value delivery, and finance.

The problem with many business ideas is that, while they sometimes hit on a few of these areas, they are weak in others. That can be okay, but as a startup, your goal should be to develop a plan about how to address your idea’s weak points, and quickly test whether that plan is viable. If it is, great! If not, reconsider or proceed with caution!

The template

The customer

The specific person to whom value is being provided and is making a purchasing decision. Many times a business’s value proposition will include many people. For example, if you are selling family vacations, your business is offering value to the father (maybe you bundle in a few rounds of golf), mother (there’s a reason cruises have spas), and kids (I don’t think Disney pays actors to dress as Micky Mouse for dad). This is important to understand, especially when you get to the value delivery portion of the business model sketch. However, first identify the decision maker!

Value proposition

The value proposition is a description of the value you are providing your customer. Remember that your customer is the person making the purchasing decision, so when crafting your value proposition, you need to understand the wants and needs of that particular person above any secondary party.

Ideally, focus your value proposition on addressing core human drives. The Personal MBA identifies five core human drives: the drives to acquire, bond, learn, defend, and feel. Many consumer products clearly target their value propositions to these core human drives. This can be difficult if you are not selling consumer products or services, but if you look closely, many business-to-business or business-to-government sales are targeted the same way. Living in Washington, DC I notice this every time I go through the Pentagon metro station. Large defense contractors clearly target their buyer’s core human drive to defend with advertisements that depict the strength and sophistication of their offerings. This example of advertisements for Northrop Grumman’s unmanned drones is a great example.

Marketing

Marketing is fundamentally your strategy for getting your customer’s attention to deliver your value proposition. When you create a business model sketch, you need to attempt to identify how you will reach enough customers at an economical cost. If you have an excellent value proposition but your customers do not know about it, your business model will fail.

In the marketing portion of your sketch, you should also attempt to identify in rough terms the market size and dynamics: how many potential customers do you have and is that number increasing and/or gaining purchasing power?

Sales

If marketing is your strategy for reaching your target customer, sales is your strategy for negotiating a contract to deliver your value proposition in exchange for money. For an online business, this could be as simple as “sign up for PayPal and put a Buy Now button on my website”. If you are an enterprise software, this could be extensive contract negotiations.

Sales and marketing are closely intertwined, but separating them out to different boxes in the business model sketch should help you to separate the act of reaching your customer and delivering your value proposition (marketing) and the mechanics of “closing the deal” (sales).

Value delivery

This is what many people think of as the meat of your business model, but the previous sections are also key to determining its viability. Value delivery is the processes that delivery the promised value to your customers. In the family vacation example, this is the operations of your hotel from hiring staff to procuring food and drinks for the restaurant. If you are an online business, this is the cost of developing your website as well as operational cost of running your servers and supporting your customers.

Finance

The finance section should answer “what financial resources do I need to support this business model, and what is the return on investment?”. This is hard to answer in specifics in a business model sketch, and should be tackled in more detail in a complete business plan. However, once you complete the other sections, you should be able to answer these questions:

  • Is this a financial capital-intensive business to set up? To run? If yes, what is the risk to capital committed to the business? What return can I offer owners of capital given that risk? How can I source that capital, loans or equity investment?
  • Is this a human capital-intensive business to set up? If yes, what is my recruitment strategy? Can I attract the right talent? Would I compensate them with equity or a salary?

Writing Zero-Downtime Migrations for Rails and Postgres

Let’s suppose you are building an app. It is under heavy development and the dev team is cranking out new features left and right. Developers need to continually change the database schema, but you don’t want to take down the app for database migrations if at all possible. How the heck do you do this with Rails?

We had this problem recently, and have come up with a procedure that solves it for most small database migrations. The goals of this procedure are to:

  • Avoid downtime by running database migrations while the app is live
  • Avoid too many separate deployments to production
  • Keep the application code as clean as possible
  • Balance the cost of additional coding with the benefit of having a zero-downtime migration. If the cost or complexity of coding a migration in this way is too great, then a maintenance window is scheduled and the migration is written in a non-zero downtime fashion.

The first thing to understand when writing a zero downtime migration is what types of Postgres data definition language (DDL) queries can be run without locking tables. As of Postgres 9.3, the following DDL queries can be run without locking a table:

Postgres can create indexes concurrently (without table locks) in most cases. CREATE INDEX CONCURRENTLY can take significantly longer than CREATE INDEX, but it will allow both reads and writes while the index is being generated.

You can add a column to a table without a table lock if the column being added is nullable and has no default value or other constraints.

If you want to add a column with a constraint or a column with a default value, one option may be to add the column first without a default value and no constraint, then in a separate transaction set the default value (using UPDATE)  or use CREATE INDEX CONCURRENTLY to add a index that will be used for the constraint. Finally, a third transaction can add the constraint or default to the table. If the third transaction is adding a constraint that uses an existing index, no table scan is required.

Dropping a column only results in a metadata change, so it is non-blocking. When the table is VACUUMED, the data is actually removed.

Creating a table or a function is obviously safe because no one will have a lock on these objects before they are created.

Process for Coding the Migration

The guidelines I have been using for writing a zero-downtime migration are to:

  • Step 1: Write the database migration in Rails.
  • Step 2: Modify the application code in such a way that it will work both before and after the migration has been applied (more details on this below). This will probably entail writing code that branches depending on that database state.
  • Step 3: Run your test suite with the modified code in step 2 but before you apply the database migration!
  • Step 4: Run your test suite with the modified code in step 2 after applying the database migration. Tests should pass in both cases.
  • Step 5: Create a pull request on Github (or the equivalent in whatever tool you are using). Tag this in such a way that whoever is reviewing your code knows that there is a database migration that needs careful review.
  • Step 6: Create a separate pull request on Github that cleans up the branching code you wrote in step 2. The code you write in this step can assume that the DB is migrated.

When the migration is deployed, you’ll deploy first the code reviewed in step 5. This code will be running against the non-migrated database, but that is a-ok because you have tested that case in step 3. Next, you will run the migration “live”. Once the migration is applied, you will still be running the code reviewed in step 5, but against the migrated database. Again, this is fine because you have tested that in step 4.

Finally, once the production database has been migrated, you should merge your pull request from step 6. This eliminates the dead code supporting the unmigrated version of the database. You should write the code for step 6 at the same time you write the rest of this code. Then just leave the pull request open until you are ready to merge. The advantage of this is that you will be “cleaning up” the extraneous code while it is still fresh in your mind.

Branching Application Code to Support Multiple DB States

The key to making this strategy work is that you’ll need to write you application code in step 2 in a way that supports two database states: the pre-migrated state and the post-migrated state. The way to do this is to check the database state in the models and branch accordingly.

Suppose you are dropping a column called “deleted”. Prior to dropping the column, you have a default scope that excludes deleted rows. After dropping the column, you want the default scope to include all rows.

You would code a migration to do that like this:

Then, in your Post model, you’d add branching like this:

 But doesn’t this get complicated for larger migrations?

Yes, absolutely it does. What we do when branching like this and it gets too complicated, we either sequence the DB changes over multiple deployments (and multiple sprints in the Agile sense) or “give up” and schedule a maintenance window (downtime) to do the change.

Writing zero-downtime migrations is not easy, and you’ll need to do a cost-benefit analysis between scheduling downtime and writing lots of hairy branching code to support a zero-downtime deploy. That decision will depend on how downtime impacts your customers and your development schedule.

Hopefully, if you decide to go the zero-downtime route, this procedure will make your life easier!

Querying Inside Postgres JSON Arrays

Postgres JSON support is pretty amazing. I’ve been using it extensively for storing semi-structured data for a project and it has been great for that use case. In Postgres 9.3, the maintainers added the ability to perform some simple queries on JSON structures and a few functions to convert from JSON to Postgres arrays and result sets.

One feature that I couldn’t figure out how to implement using the built-in Postgres functions was the ability to query within a JSON array. This is fairly critical for lots of the reporting queries that I’ve been building over the part few days. Suppose you have some JSON like this, stored in two rows in a table called “orders”, in the column “json_field”:

If you want to run a query like “find all distinct IDs in the json_field’s products array”, you can’t do that with the built in JSON functions that Postgres currently supplies (as far as I’m aware!). This is a fairly common use case, especially for reporting.

To get this work, I wrote this simple PgPL/SQL function to map a JSON array.

What this function does is given a JSON array as “json_arr” and a JSON path as “path”, it will loop through all elements of the JSON array, locate the element at the path, and store it in a Postgres native array of JSON elements. You can then use other Postgres array functions to aggregate it.

For the query above where we want to find distinct product IDs in the orders table, we could write something like this:

That would give you the result:

Pretty cool!

Volatility of Bitcoin Index

A while back when I was a research assistant at the Federal Reserve, I worked on a project to make exchange rate volatility indexes for the major world currencies. We basically had some high frequency data for USD/EUR, USD/CHF, and USD/JPY and wanted to see how the financial crisis affected volatility. With all of the hype and turmoil around Bitcoin, I though it would be interesting to make a similar index for the Bitcoin/USD exchange rate.

Before Bitcoin is ever able to become a viable “currency”, volatility needs to come down a lot. Low volatility isn’t sufficient for it to take off, but is probably is necessary. If you take the traditional definition of a currency as a “store of value”, a “medium of exchange”, and a “unit of account”, persistently high volatility is absolutely a death knell. This is especially true in Bitcoin’s case where there is no government backing and there are attractive substitutes in traditional currencies.

One of the cool things about Bitcoin, however, is that lots of the data is fairly open. Most of the rest of the financial market data in the U.S. is behind the copyright lock and seal of the major exchanges and electronic trading networks. Both the NYSE and NASDAQ make lots of money off of selling market data, and they recently won a court case to raise their rates even further. That makes doing this kind of analysis on other securities an expensive endeavor for armchair quarterbacks like myself!

Bitcoin, however, has a rather open ethos and most of the exchanges publish realtime or near-realtime price data for free. CoinDesk aggregates this into their Bitcoin Price Index, which they graciously agreed to send over for this analysis.

The Volatility of Bitcoin Index (VOB Index—I’m open to suggestions on the name) is a simple rolling standard deviation of minutely log-returns. That is not nearly as complicated as it sounds. To create the index, I started with a time series of minutely Bitcoin price data and calculated the log returns (basically percent increase or decrease for each minute).  Then for each period, I took all of the returns within a trailing window and computed the standard deviation. I did this for three window lengths: 30, 60 and 90 days, and then annualized it. The Bitcoin markets are 24/7/365, so there are no holiday or weekend adjustments, which makes things a bit easier.

Here’s what the index looks like for the period August 2011 to January 31, 2014.

Volatility of Bitcoin IndexHere’s the BPI (Bitcoin Price Index) over that period.

BPI Index
Bitcoin volatility looks like it is dropping over time, especially from the early days in 2011 and 2012, except for a large bump in volatility in April of 2013. That was probably sparked by problems at Bitcoin’s largest exchange, Mt. Gox during that time period. This decrease is despite an increase in speculative interest in Bitcoin and moves by China to curtail the currency’s use.

Whether Bitcoin takes off is anyone’s guess and predicting if it does is probably a fool’s errand. There are lots of smart people working on it, but big questions about its viability still remain. In either case, it should be exciting to watch!

Tech Predictions for 2014

This is a just a bunch of fun tech predictions for 2014. I can’t claim any insider knowledge, but it will be interesting to look back on them at the end of the year to see what the outcomes are!

Bitcoin will become “legitimate”, but not widespread

In 2014, regulators will likely issue guidance on how exactly companies and people can deal in Bitcoin without running afoul of anti-money laundering and money transmitter regulations, and how it is to be taxed. This will “legitimize” Bitcoin, but because of the difficulty of using the cryptocurrency and the insane amount of volatility of Bitcoin against fiat money, it will not become mainstream in 2014. Instead, Bitcoin will be relegated to particular types of transactions that the mainstream banks are either not accepting (due to the high probability of fraud or high compliance costs) or are charging exorbitant fees to facilitate. These will likely be cross-border transactions, especially in countries with capital controls (Argentina), legally grey-area transactions, or transactions that the major payment processors don’t accept.

Docker plus Deis or Flynn will result in an “open-source Heroku” and lots of dev-ops innovation

Heroku is probably one of the biggest dev-ops innovations in the last five years, making it vastly easier to deploy web applications, freeing development teams to actually build applications instead of focusing on DevOps. Heroku’s 12 Factor App architecture is now widely used and 2014 will see that continuing. However, Heroku has a few problems.

First, There currently isn’t a good way to “build your own Heroku” out of open source components. If Heroku is too constraining, you are forced to spin up your own servers on Amazon Web Services for part of your application, which eliminates lots of the advantages Heroku brings to the table. Last year there was a ton of excitement about Linux containers (LXC) and Docker, which is abstraction on top of LXC that makes them easier to manage. Both Heroku and Google Cloud use Linux containers internally, but they are closed-source implementations. Docker will likely begin to change that this year.

However, Docker alone is not a Heroku replacement. Much of the innovation in Heroku lies in the build packs and routing mesh, which Docker does not provide. Two other open source projects aim to become that layer on top of Docker, and these are the ones I’m most excited to watch. The first is Deis, which seems to be the furthest along in creating an open-source Heroku. Deis has both a routing mesh and an application layer created as well as scripts to automatically scale your “personal Heroku” on AWS, Digital Ocean and Rackspace. Flynn has many of the same goals, but doesn’t appear to be as far along. Deis has commercial support from Opsware, while Flynn is raising money Kickstarter-style to build out their platform. In any case, while Heroku is great, it is very exciting to see open source competitors come to the scene.

AngularJS and EmberJS will win the front-end wars for web apps

For highly interactive web apps, both AngularJS and EmberJS will become the clear choices in 2014. Backbone, Knockout and other front-end JS frameworks will see declining usage simply because they don’t provide as much of a framework as AngularJS or EmberJS. For new sites except for “CMS-like” apps, people will stop generating HTML on the server and push the page rendering to Javascript on the browser. Because of this, back-end frameworks will pivot towards being better REST API servers and focus less on their HTML template rendering abilities. The wildcard is “CMS-style” sites that need to be indexed by Google. While Google’s crawler can execute Javascript, content-heavy sites will still need a mechanism to serve HTML from the server for reliable SEO. This means that full-stack Rails apps will still be important in 2014. I think the writing is on the wall for this kind of app, however.

Mobile will continue to be “write twice” for the highest quality

Unfortunately, while HTML5 is great, it still won’t deliver the highest quality apps on mobile in 2014. As a cost-saving measure, or for apps that don’t need lots of interaction, HTML5 will be a viable choice. However, to create the highest-quality mobile apps in 2014, we’ll still need to write them twice: once for Android and once for iOS.

Wearable tech won’t be mainstream; in fact, society will push back from being “always connected”

Google Glass and the like will remain curiosities and not mainstream. In fact, I think that people are beginning to push back from being always connected to the Internet. Smartphone usage in many social situations is become a faux pas and the word “glassholes” has already been coined for people that wear Google Glass in public. That being said, we will see the Internet smartly integrated into more consumer products, including continued innovation in automobile technology and home automation. The key for the “Internet of things” in 2014 will be unobtrusive, discrete, and large value-add, which probably isn’t wearable technology in its current form.