Using the Draft OAuth Assertion Grant with Google+

The IETF has been working on a new OAuth standard for “assertions” which enables OAuth to work with other types of authentication systems. This can be used to allow users to authenticate with your API through Google+ or other third-party identity providers.

For example, let’s say you are developing a single-page Javascript app or a mobile app that uses both Google’s APIs as well as your own APIs. You’d like to have users authenticate with Google to obtain access to Google’s APIs, but then you’d also like your app to authenticate with your server to gain access to some additional resources. You’d like to not reinvent the wheel and use OAuth for your own API. You also implicitly trust Google to verify the user’s identity, so you don’t want the user to need to go through another OAuth flow just to use your API.

Assertion grants allow you to do this in a standards-compliant way. This is a draft standard that was just submitted in July of 2014, but for this simple use-case, it is already fairly usable.

How Google+ handles sign in for “combination” apps (with both a client and a server)

Google has some great documentation on how to authenticate both a client and a server, which is worth reading if you plan on implementing this. The gist of it is that first the client authenticates with Google through a OAuth popup or redirect. This gives the client both an access token and an access code. The code is then passed to the server to authenticate the backend.

This “passing the code to the backend step” is what OAuth assertion grants enable in a standards-compliant way.

OAuth Assertion Grants

The IETF Assertion Grant spec defines a way to define new grant types that are assertions of identity from third parties. An assertion grant looks like this (from the example in the spec):

Assertions are very similar to Resource Owner Password Credential grants in that they are passed as HTTP POSTs directly to the /token endpoint. The “grant_type” for an assertion must be a absolute URI that defines the assertion type, the “assertion” is a Base64-encoded string (using URL-safe encoding) that contains the actual asserrtion, and the “scope” is the same as other OAuth grant types.

An OAuth Assertion Grant for Google+

Since Google has not defined an assertion grant format for Google+ identity, I’ve decided to make one up! You can feel free to steal this format for your own apps.

For my Google+ assertion grant, I’ve just chose “urn:googlepluscode” as the URL. This is arbitrary, but Google would need to standardize this so we currently don’t have a better option. For the assertion itself, I use a Base64-encoded, url-safe version of this JSON:

Verifying the Google+ assertion grant

When the backend receives the Google+ assertion grant, it should do these steps to verify it:

  1. Convert the access code into an access token
  2. Call the /oauth/tokeninfo endpoint with the access token from the previous step
  3. In the response from the tokeninfo endpoint, confirm these things:
    1. The user_id matches the google_plus_user_id in the assertion
    2. The issued_to from the tokeninfo response matches the client_id of your application (both the front-end and back-end share the same client_id.

Stay tuned for a future post on how to implement this with Rails and Doorkeeper!

How secure is the OAuth2 “Resource Owner Password Credential” flow for single-page apps?

I’ve been working on a single-page, browser-based app and I was investigating using the OAuth2 “Resource Owner Password Credential” (ROPC) flow to log users in without needing a normal OAuth popup or redirect. The single-page app is written by the same developers as the backend API, so it is more trusted than a third-party application (which should never touch a user’s password). However, since it is a client-side application in Javascript, it was unclear to me how to take steps to make this as secure as possible, so I did some research. In this post, I’ll describe what I found.

What the OAuth spec says

The OAuth spec is a dense monster, but is worth digging into since so many sites are using OAuth today. The relevant section of the spec says that the ROPC flow can be used when the resource owner (the user) “has a trust relationship with the client, such as the device operating system or a highly privileged application”, which would apply to an application developed by the same developers as the API server. The spec also says that it should only be used when other flows are “not viable”. This isn’t strictly the case for single-page Javascript applications, which can use the Implicit Grant flow or the Authorization Code flow. However, for clients “owned” by the same owner as the authorization server, the OAuth popup or redirect can be a poor user experience and may confuse users since they wouldn’t expect to “authorize” an app that they perceive as one and the same as the service itself. So, assuming you trust the client and are willing to consider “bad user experience” as “not viable”, you could use the ROPC flow for a front-end client.

The other issue is that Javascript clients cannot disguise their client credentials because the user may just “view source” to retrieve the credentials. This makes client impersonation possible. It also means the the client is a “public” client for the purposes of the OAuth spec, and client authentication is not possible. The OAuth spec states that when client authentication is not possible, the authorization server SHOULD employ other means to validate the client’s identity.

How can we “validate the client’s identity” as best as possible with Javascript clients?

First, we need to accept that because that this is a public client under control of the user, we’ll have to accept that it is impossible to completely prevent client impersonation. You always could impersonate a client with cURL or a web scraper, which is something that is out of the control of the API owner. To prevent this, we’d need some kind of trusted computing architecture where we are 100% certain that the client credentials are protected from prying eyes.

Since we can’t completely prevent client impersonation, we need to define what types of impersonation we are trying to prevent. For Javascript clients, I want to prevent two types of impersonation:

  1. Impersonation by another Javascript client running in a standards-compliant browser on a domain other the official client’s domain
  2. Compromised client Javascript or HTML

Both types of impersonation are already well-known and have solutions in other Internet standards that we can use for this case.

Preventing compromised client source code

For this one, we can simply use SSL for the client’s domain. If the source code has been compromised through a man-in-the-middle attack, the user will see an SSL error in the browser. The OAuth spec already requires that communication to the authorization server’s token and authorization endpoints occur over SSL. It is permitted in the OAuth spec to have a client delivered over HTTP, however.

In order to use the ROPC grant type for Javascript clients, we need to be more strict than the spec and absolutely ensure that the client is delivered over SSL. If the Javascript client is not delivered over SSL, a middleman could tamper with the client’s Javascript to intercept either the resource owner’s credentials or the access token. This makes it impossible for the resource owner to trust the client, which breaks the first chain of trust between the resource owner and the authorization server.

Preventing impersonation by other Javascript clients

The other kind of impersonation we’d like to prevent is another Javascript client (on some other domain) using the official client’s credentials to retrieve access tokens. To do this, we can use the browser’s cross-origin security model.

If your client is on the same origin as your authentication server

If you are running a client on the same origin as the authentication server, requests to the authentication server will be permitted through “normal” AJAX and I believe that all you will need to do is not permit cross-domain requests (i.e. don’t enable CORS) on your authentication server and the ROPC flow will be unavailable to impersonating clients. Here’s why:

  • It is possible to submit a form from another domain to kick off the ROPC flow (a POST to your token endpoint), however, it is not possible for Javascript running on that other domain to access the response. This means that the impersonating Javascript may cause your API server to return an access token via a form submission, but it wouldn’t be possible for it to read that token. Since we are not using cookie-based authentication, the client needs to parse the token response for it to become authorized.
  • It is not possible for a third-party (an intermediate proxy) to intercept the token in this way because the browser will be communicating with your server over SSL (you are using SSL for your authentication server, right!?).
  • You need to ensure that potentially-impersonated POSTs to your token endpoint are not in any way destructive. Typically, CSRF attacks (of which this technically is one) lead to a compromise by either setting a cookie that is later used to access a protected resource or cause a POST that takes an abusive action (withdrawing money). You’ll need to ensure that a POST to your token endpoint doesn’t do either of these things.

If your client is on a different origin from your authentication server

If you are running your client on “yourdomain.com” and your API server on “api.yourdomain.com”, you will need to implement CORS anyway. In this case, you should leverage CORS to validate the client. Here’s how you can do it:

  • For every ROPC-enabled client, record in your API server’s database the acceptable Javascript origins for that client.
  • When an incoming ROPC grant type comes in, require your client to provide a client ID. Look up that client ID in your database and confirm that the CORS “Origin” header matches the expected origin. Browsers do not permit Javascript clients to forge the “Origin” header, making this robust against Javascript client spoofing.

Additional considerations

Since IE9 and below don’t implement CORS correctly, many sites implement work-arounds such as iframe proxies or Flash-based work-arounds. I haven’t looked into the implications of using these, but they definitely need careful consideration to make sure they are not exploitable.

You absolutely should implement some kind of rate-limiting on your token endpoint to prevent brute-force attacks.

Finally, you should never issue public clients a refresh token (or any long-lived access token). The reason for this is that, depending on your backend architecture, these could be difficult to revoke should you need to revoke access to a specific client. For example, if you are using a JSON Web Token instead of a database record, you would need to blacklist all of them it to revoke them.

Comments welcome!

OAuth2 is still relatively new (as is CORS), so if I’ve missed any ways for this to be exploited, let me know in the comments! Thanks.

Installing WxPython and RunSnakeRun on Mac OSX 10.9

I just posted about my Ctrl-C strategy for profiling and now I’m going to completely flip-flop and explain how I installed RunSnakeRun, a way to visualize the output of Python’s cProfile. The Ctrl-C way of profiling worked really well for optimizing append performance of my time series storage library, but doesn’t work so great for profiling things that are already very fast (on the order of milliseconds) and need to be faster.

For that, RunSnakeRun worked really well. RunSnakeRun gives you a nice rectangle chart showing in which functions your program spends most of its time.

RunSnakeRun's rectangle plot of function cumulative run time

RunSnakeRun’s rectangle plot of function cumulative run time

To install RunSnakeRun on Mac OSX, you’ll need Homebrew and PIP. You can install it like this:

Next, you’ll need to export a pstats database with your profiler information. Use cProfile to do this. For TsTables, you can run the benchmark with profiling information like this:

This will create a tstables.profile file in the current directory, which you can open with RunSnakeRun. Start RunSnakeRun by running runsnake  (assuming that PIP’s bin folder is in your path).

The barebones profiling method that is surprisingly effective

I’m working on profiling my time series database (TsTables) because append performance is not what I want it to be. I know that the issue is a few loops that are written in Python instead of using NumPy’s optimized vector operations. I’m not exactly sure which loop is the slowest.

I started trying to get cProfile to work, but ended up with way too much data to be useful. So I reverted to my old school, barebones profiling method: Ctrl-C.

How do you use this method you might ask? Start your program and randomly hit Ctrl-C. Wherever your program stops most frequently is the probably the slowest part. Speed that up and repeat!

Two Podcasts I’ve been Enjoying Recently

Since I’ve moved offices to WeWork Chinatown, I have about a 20 minute commute on Metro. The nice thing about commutes is that you have some downtime, which recently I’ve been using to listen to podcasts.  If you’re looking for podcast recommendations, two good ones are Monocle’s The Entrepreneurs and the Tim Ferriss Show.

The Entrepreneurs kind of makes you feel like you are listening to BBC’s World Service or NPR, but the content is focused on business and entrepreneurship. It has a refreshing international and non-technology bent, which is great for getting out of the U.S. technology world and realizing that every new business doesn’t need to be a website started in San Francisco!

The Tim Ferriss Show is by the author of the Four Hour Body and Four Hour Workweek, both NY Times bestsellers. The format is much longer than The Entrepreneurs, and is more of a conversation than a highly-produced magazine-style show. There are only a few episodes of this one, but so far the guests have been very interesting and Tim does a good job interviewing them in depth.

The Ultimate Guide to Dealing with High Frequency Data

I have no idea if this is actually the ultimate guide to high frequency data, but hopefully it is at least a useful guide!

I’m currently working on a project to replace the Time Series Database (TSDB) that I wrote a few years ago. By re-writing it, I’m learning a lot about what works and what doesn’t when dealing with high frequency data. High frequency in this case means billions of records, with timestamp precision down to the millisecond. This data is being used for economic research and analytics, not live trading. It is a pain to deal with because it is simply too large to fit in memory or process with a standard relational database. These are some rules that I’ve developed while working on this project.

Partition your data!

The biggest issue with this much time series data is that you simply cannot index the timestamp column with any normal indexing scheme. In DB-speak, the timestamp column will be a “high cardinality” column, which means it does not lend itself well to indexing. This is a problem because most queries on this kind of high frequency data are to fetch a subset by timestamp, and you do NOT want to make a table scan of a billion plus records to find a few minutes of data.

TSDB attempts to fix this problem by keeping rows in order and creating a sparse index (an index on every 20,000 rows). This should work in theory, but you must ensure that your rows are always sequential. That makes inserting or updating difficult, because you potentially need to shift rows to maintain the order. Also, I’m not aware of a relational database that lets you create a sparse index, which rules out the most common and best understood data stores.

Another approach is to partition your data. This is my current approach. The way this works is you simply create multiple tables for set time periods (one table per day or month is a good starting point). You then put records that match those time periods in their respective tables and write some logic to union queries across tables.

Partitioning enables the database to hunt through a subset of rows to find the ones that match your query. Both Postgres and MySQL support partitioning, making them viable options for storing time series data. The library that I’m working on will use PyTables to partition time series data by date.

Store timestamps in UTC

Most of the time, your source data will have timestamps in UTC. If it doesn’t, I suggest you convert to UTC before storing. Most libraries use either UTC or local time internally, and because you can never be sure what time zone your users will be in, using UTC is the least common denominator.

UTC also has the nice property of not having daylight saving time changes. DST causes all sorts of pain when working with 24/7 data. Avoid it by just dealing in UTC internally, and then converting to other timezones for querying or display.

Store timestamps as integers, even if your programming language uses floats

MATLAB, Excel, and R all store timestamps internally as floats by default. This gives their timestamp types a large range and high precision, but I don’t think it is appropriate for archiving time series data. Why? Floats are imprecise. You do not know with any accuracy the number of significant digits when using a float, and you cannot make comparisons without worrying about round off errors. Admittedly, even with microsecond data, these systems that use a 64-bit double and 1970-01-01 as the epoch will not loose precision until 2242-03-16, but why worry about it? I recommend a 64-bit integer as the timestamp column. With one tick equaling one millisecond, you have a time range of ±292 million years. This is Java’s internal representation. With one tick equaling 100 nanoseconds (0.1 microsecond), you have a time range of ±29,227 years, which is what Win32 does. Should be plenty!

Have a solid ETL procedure

ETL means “extract, transform, load” and is the term for taking data out of one format or system and loading it into another. The step to make sure is solid when you are dealing with high frequency data is the “load” step. Try to make a process where you can revert a botched load automatically. If you don’t do this, guaranteed someone or something will screw up an import, and you will be left wading through millions of rows of data to fix it or re-importing everything from your source system.

The most basic way to make the load step revertible is to just make a copy of your time series before writing anything to it. You could devise a more sophisticated process, like using rsync to make diffs of your time series, but be nice to your future self and make backups at the very least!

The Business Model Sketch

I was helping a friend work through some business ideas, and realized that like writing an outline helps you structure a essay, doing a “business model sketch” can help you break apart a business idea and evaluate its viability. Just like an essay needs a thesis, a body and a conclusion or it can fall flat, a business model needs specific components or it is not viable. I identified six components that roughly draw from The Personal MBA and the “lean startup” philosophy. If you are thinking of starting a new business, consider taking one page of paper, drawing six boxes, and filling out these six areas: customer, value proposition, marketing, sales, value delivery, and finance.

The problem with many business ideas is that, while they sometimes hit on a few of these areas, they are weak in others. That can be okay, but as a startup, your goal should be to develop a plan about how to address your idea’s weak points, and quickly test whether that plan is viable. If it is, great! If not, reconsider or proceed with caution!

The template

The customer

The specific person to whom value is being provided and is making a purchasing decision. Many times a business’s value proposition will include many people. For example, if you are selling family vacations, your business is offering value to the father (maybe you bundle in a few rounds of golf), mother (there’s a reason cruises have spas), and kids (I don’t think Disney pays actors to dress as Micky Mouse for dad). This is important to understand, especially when you get to the value delivery portion of the business model sketch. However, first identify the decision maker!

Value proposition

The value proposition is a description of the value you are providing your customer. Remember that your customer is the person making the purchasing decision, so when crafting your value proposition, you need to understand the wants and needs of that particular person above any secondary party.

Ideally, focus your value proposition on addressing core human drives. The Personal MBA identifies five core human drives: the drives to acquire, bond, learn, defend, and feel. Many consumer products clearly target their value propositions to these core human drives. This can be difficult if you are not selling consumer products or services, but if you look closely, many business-to-business or business-to-government sales are targeted the same way. Living in Washington, DC I notice this every time I go through the Pentagon metro station. Large defense contractors clearly target their buyer’s core human drive to defend with advertisements that depict the strength and sophistication of their offerings. This example of advertisements for Northrop Grumman’s unmanned drones is a great example.

Marketing

Marketing is fundamentally your strategy for getting your customer’s attention to deliver your value proposition. When you create a business model sketch, you need to attempt to identify how you will reach enough customers at an economical cost. If you have an excellent value proposition but your customers do not know about it, your business model will fail.

In the marketing portion of your sketch, you should also attempt to identify in rough terms the market size and dynamics: how many potential customers do you have and is that number increasing and/or gaining purchasing power?

Sales

If marketing is your strategy for reaching your target customer, sales is your strategy for negotiating a contract to deliver your value proposition in exchange for money. For an online business, this could be as simple as “sign up for PayPal and put a Buy Now button on my website”. If you are an enterprise software, this could be extensive contract negotiations.

Sales and marketing are closely intertwined, but separating them out to different boxes in the business model sketch should help you to separate the act of reaching your customer and delivering your value proposition (marketing) and the mechanics of “closing the deal” (sales).

Value delivery

This is what many people think of as the meat of your business model, but the previous sections are also key to determining its viability. Value delivery is the processes that delivery the promised value to your customers. In the family vacation example, this is the operations of your hotel from hiring staff to procuring food and drinks for the restaurant. If you are an online business, this is the cost of developing your website as well as operational cost of running your servers and supporting your customers.

Finance

The finance section should answer “what financial resources do I need to support this business model, and what is the return on investment?”. This is hard to answer in specifics in a business model sketch, and should be tackled in more detail in a complete business plan. However, once you complete the other sections, you should be able to answer these questions:

  • Is this a financial capital-intensive business to set up? To run? If yes, what is the risk to capital committed to the business? What return can I offer owners of capital given that risk? How can I source that capital, loans or equity investment?
  • Is this a human capital-intensive business to set up? If yes, what is my recruitment strategy? Can I attract the right talent? Would I compensate them with equity or a salary?

Writing Zero-Downtime Migrations for Rails and Postgres

Let’s suppose you are building an app. It is under heavy development and the dev team is cranking out new features left and right. Developers need to continually change the database schema, but you don’t want to take down the app for database migrations if at all possible. How the heck do you do this with Rails?

We had this problem recently, and have come up with a procedure that solves it for most small database migrations. The goals of this procedure are to:

  • Avoid downtime by running database migrations while the app is live
  • Avoid too many separate deployments to production
  • Keep the application code as clean as possible
  • Balance the cost of additional coding with the benefit of having a zero-downtime migration. If the cost or complexity of coding a migration in this way is too great, then a maintenance window is scheduled and the migration is written in a non-zero downtime fashion.

The first thing to understand when writing a zero downtime migration is what types of Postgres data definition language (DDL) queries can be run without locking tables. As of Postgres 9.3, the following DDL queries can be run without locking a table:

Postgres can create indexes concurrently (without table locks) in most cases. CREATE INDEX CONCURRENTLY can take significantly longer than CREATE INDEX, but it will allow both reads and writes while the index is being generated.

You can add a column to a table without a table lock if the column being added is nullable and has no default value or other constraints.

If you want to add a column with a constraint or a column with a default value, one option may be to add the column first without a default value and no constraint, then in a separate transaction set the default value (using UPDATE)  or use CREATE INDEX CONCURRENTLY to add a index that will be used for the constraint. Finally, a third transaction can add the constraint or default to the table. If the third transaction is adding a constraint that uses an existing index, no table scan is required.

Dropping a column only results in a metadata change, so it is non-blocking. When the table is VACUUMED, the data is actually removed.

Creating a table or a function is obviously safe because no one will have a lock on these objects before they are created.

Process for Coding the Migration

The guidelines I have been using for writing a zero-downtime migration are to:

  • Step 1: Write the database migration in Rails.
  • Step 2: Modify the application code in such a way that it will work both before and after the migration has been applied (more details on this below). This will probably entail writing code that branches depending on that database state.
  • Step 3: Run your test suite with the modified code in step 2 but before you apply the database migration!
  • Step 4: Run your test suite with the modified code in step 2 after applying the database migration. Tests should pass in both cases.
  • Step 5: Create a pull request on Github (or the equivalent in whatever tool you are using). Tag this in such a way that whoever is reviewing your code knows that there is a database migration that needs careful review.
  • Step 6: Create a separate pull request on Github that cleans up the branching code you wrote in step 2. The code you write in this step can assume that the DB is migrated.

When the migration is deployed, you’ll deploy first the code reviewed in step 5. This code will be running against the non-migrated database, but that is a-ok because you have tested that case in step 3. Next, you will run the migration “live”. Once the migration is applied, you will still be running the code reviewed in step 5, but against the migrated database. Again, this is fine because you have tested that in step 4.

Finally, once the production database has been migrated, you should merge your pull request from step 6. This eliminates the dead code supporting the unmigrated version of the database. You should write the code for step 6 at the same time you write the rest of this code. Then just leave the pull request open until you are ready to merge. The advantage of this is that you will be “cleaning up” the extraneous code while it is still fresh in your mind.

Branching Application Code to Support Multiple DB States

The key to making this strategy work is that you’ll need to write you application code in step 2 in a way that supports two database states: the pre-migrated state and the post-migrated state. The way to do this is to check the database state in the models and branch accordingly.

Suppose you are dropping a column called “deleted”. Prior to dropping the column, you have a default scope that excludes deleted rows. After dropping the column, you want the default scope to include all rows.

You would code a migration to do that like this:

Then, in your Post model, you’d add branching like this:

 But doesn’t this get complicated for larger migrations?

Yes, absolutely it does. What we do when branching like this and it gets too complicated, we either sequence the DB changes over multiple deployments (and multiple sprints in the Agile sense) or “give up” and schedule a maintenance window (downtime) to do the change.

Writing zero-downtime migrations is not easy, and you’ll need to do a cost-benefit analysis between scheduling downtime and writing lots of hairy branching code to support a zero-downtime deploy. That decision will depend on how downtime impacts your customers and your development schedule.

Hopefully, if you decide to go the zero-downtime route, this procedure will make your life easier!

Querying Inside Postgres JSON Arrays

Postgres JSON support is pretty amazing. I’ve been using it extensively for storing semi-structured data for a project and it has been great for that use case. In Postgres 9.3, the maintainers added the ability to perform some simple queries on JSON structures and a few functions to convert from JSON to Postgres arrays and result sets.

One feature that I couldn’t figure out how to implement using the built-in Postgres functions was the ability to query within a JSON array. This is fairly critical for lots of the reporting queries that I’ve been building over the part few days. Suppose you have some JSON like this, stored in two rows in a table called “orders”, in the column “json_field”:

If you want to run a query like “find all distinct IDs in the json_field’s products array”, you can’t do that with the built in JSON functions that Postgres currently supplies (as far as I’m aware!). This is a fairly common use case, especially for reporting.

To get this work, I wrote this simple PgPL/SQL function to map a JSON array.

What this function does is given a JSON array as “json_arr” and a JSON path as “path”, it will loop through all elements of the JSON array, locate the element at the path, and store it in a Postgres native array of JSON elements. You can then use other Postgres array functions to aggregate it.

For the query above where we want to find distinct product IDs in the orders table, we could write something like this:

That would give you the result:

Pretty cool!