Building a Big Data platform with Node.js, ElasticSearch and HBase

I recently resigned from my job at Pearson, after almost two years as a Senior Javascript Developer building their Big Data platform, Palomino. I couldn't have asked for a better team to work with. We started small, with just 4 of us, but ended up growing up to a 10-strong team. We used agile methodologies, always trying to put people before processes, in a way we colloquially describe as "non-bullshit" approach. I recollect here an informal account of my experiences and the lessons i've learnt building the platform.

A look at the Palomino Big Data Platform.

Starting out small

Initially, my job was to create the front end for the system, so that users could build queries and visualise the results. I started looking into MV* libraries. My first choice was ember.js, mainly because I had heard good things about it on the IRC channels I frequent. I decided to go ahead and start a quick pet project with it to see if I liked it. The first thing we tried to build was a user story manager for our team. While the results were good, we felt that ember would be a bit cumbersome for building what we wanted, so we went for something less 'all-or-nothing' in nature.

I decided to go for Backbone.js because it made the most sense to me: Models, Views and Routers, which could be used in separation. There was no need to fully commit to the framework, you can pick and choose the bits you like, or use the whole thing. The source code is tiny and easily readable, and the documentation is excellent. The building blocks are easily customisable, allowing you to override, add or remove methods as you see fit, and provided a more hands-on feel than other solutions as angular and ember. 

The front end was designed to allow users to build queries and visualise them. In order to create in-browser graphs, we initially chose NVD3, as it provided an easy way to create elegant, nice looking visualisations. This would prove a problematic decision - once we had a product manager on board, there would be customisation requirements that would prompt us to create our own charting library, Bridle. More on that later.

We need an API

From the onset, we realised we would need to create some sort of API mid-layer, because dealing with the raw Elasticsearch results was a bit difficult on the client side. I was given free reign on which technologies to use to build the API, we decided to use Node.js and Express to build it.

The API would take standardised queries from the front end and translate them into elasticsearch's query DSL. At first, ES's query language seemed very awkward to me, especially due to its nesting nature. However, once I became familiar with it, it became easier to use, and quite enjoyable.

We make heavy use of the date histogram queries, as our system mainly counts events and informs on them. A simple date histogram Elasticsearch query looks a bit like this:

{
"query" : {
"match_all" : {}
},
"facets" : {
"myHistogam" : {
"date_histogram" : {
"field" : "apples",
"interval" : "day"
}
}
}
}

Date histogram faceting lets you create time series. However, building these query objects is quite tricky, so we decided to do that work on the backend.

The node.js app we built was initially pretty simple. An express router that catches the data requests, parses the body of the request looking for filters, builds an elasticsearch query object and then queries our ES cluster. We kept the results on memcache and stored the queries sent on a mongodb database. 

After a few iterations, we realised the API was quite slow - especially when many users were querying the system at the same time. Some queries return a lot of data that needs processing into a time-series, and this is sometimes time-consuming. In order to overcome this, we resorted to a worker-farm structure.

A main node process receives a data request and checks whether it is in the memory store or not. If it is, it returns the results right away. If it isn't, it spawns a worker process that queries the Elasticsearch servers in the background. While the worker is busy querying, we return a 201 / 202 response to the user, indicating that their request is in progress and will be available shortly. Query consumers can just poll the same url or follow the location header of the response. Once the request completes, the worker exits and the data is stored in the memory store, so that next time the user polls, it'll be returned right away.

This new architecture improved both response times and made our servers way faster, being able to handle heavier loads. The only downside is that it moved the bottleneck up into the cluster. But at least it wasn't my fault anymore :) 

The other main change we made to the architecture was to move from memcache to redis. We constantly hit size constraints on memcache when storing data and results larger than 2MB, so we moved to redis so we could store items of any size. The errors went away, and the system works perfectly. 

Timezones, bloody timezones

One of the biggest problems we faced was handling timezones. There is now a big poster at the office to remind us of this fact: 

This poster is hung on the wall to remind us of the blood, sweat and tears shed to get Timezones right.

Unfortunately, Elasticsearch likes to bucket data up depending on the interval you specify on your date histogram query. Our initial, naive approach consisted on using UTC everywhere, and then just rely on sending an ISOString from the front end for the date ranges.

However, this meant that for users in USA or India, the data would be skewed. This is because when you convert a Javascript date object to an ISOString, you lose the timezone information - it gets turned into a UTC string, with the `Z` at the end. When the API received this string it would have no idea what timezone the query originated from, as it was just trying to do Date.parse() on it. 

With the timezone information getting lost in translation, we tried two approaches. First, we tried to use moment.js to build proper ISO timezone information on the date strings we sent to the api. Unfortunately, when slicing on a day/week/month interval, Elasticsearch ignores the timezone information, and would still return results bucketed for UTC. This was confusing our American users, and wasn't acceptable.

It turns out that Elasticsearch allows bucketing to be adjusted by passing timezone information to the date histogram facet. To do so, our second approach was to perform timezone detection on the client side and send the resulting IANA / Olson date string (i.e. "Europe/London") as part of the query. Client side timezone detection isn't 100% accurate (it won't differentiate between say, Europe/Berlin and Europe/Stockholm), but the tests we wrote passed on the browsers we support.

Timezones added an extra layer of complexity that no one in the team had predicted. We learnt a lot about timezones in the process, and now we're all self-professed subject matter experts.

Build your own reusable charts library

I mentioned before we were using NVD3 for drawing charts. While I consider myself to be pretty good at D3.js, I was attracted by the philosophy of the library. It's built on D3.js, following Mike Bostock's reusable chart paradigm, it has very nice transitions and it provided all the visualisations we needed. This would release my time so I could work on the API and the front end of the app.

Before we were set on NVD3 we also tried Rickshaw.js, but it didn't stick with us for too long - it was hard to customise and most of the code was pretty obscure. NVD3 seemed like a better candidate.

But shortly after we adopted NVD3, Modus Partners, the investment company for whom the creator of the library worked, took down their entire github repo.

This was allegedly done as a response to the increased popularity of the library, and well, investment bankers being investment bankers, they must have seen profit in the library and decided to take it down in an effort to preserve their IP. 

We decided to write our own Javascript charting library based in d3.js, bridle

The thing is, by that point many people had forked the library, many people had contributed with bugfixes and new features, and well, the library was out in the public domain.

In the end , Novus Partners did a bit of a U-Turn and a month later the repo was back up and running. But by that time, we had realised the problems that can arise when you depend on libraries built by third parties, and decided to create our own charting library, Bridle.

That library is now at the core of the next version of the system's frontend, and it's being adopted by other areas of the business. It is completely open source, so why don't you go ahead and fork it?

All in all, I've had a great stint working here - I think it was probably the best in my career - and I'm now looking at the future with optimism. I owe a lot to my team and I'll surely miss them.

Here's to the future.