程序员瑞典最大的约会网站的开发测试经验

jopen 10年前

程序员Rinat Abdullin在博客直播了8个月,关于自己如何同3个人的团队一起改造瑞典最大的约会网站HappyPancake ,涉及从设计、研发到测试的诸多经验,值得一看。

Table of Contents

</div> </div>
There are more similar articles in: design · ddd · story · popular ·
</div> </div>

I’m publishing this article in a series of regular updates. The last one was on August 16.

Introduction

At the end of 2013 I was invited by Tomas Roos to join the team of HappyPancake, largest free dating site in Sweden. It was initially written in ASP.NET with MS SQL Database server, soon to become a rather complex solution that was expensive to scale.

Together with Pieter Joost, our small distributed team of 3 people started redesigning this site towards a simpler design that would be easier to evolve in the future.

This is the story of that project. The process is ongoing - I’m usually adding one new chapter each week to R&D Blog of HappyPancake, appending the same text to this post.

December 17 - My first week

My starting days in HappyPancake were quite intense and interesting, despite the fact that I could spend only 20 hours per week on the project. I learned a lot of things that would be be completely out of reach for a .NET developer within Microsoft OS. Here are some bullet-points:

  • Google Hangouts work nicely for team collaboration and screen sharing.
  • SyncSpace drawing app is probably the best collaborative white-board for a distributed team.
  • Mindmups are great for collaborative and personal brain-storming.
  • Erlang is a great language for building low-latency and CPU-efficient apps. It has some learning overhead for a .NET guy like me.
  • Golang is a great language with good ecosystem and really good performance. If compared to erlang, golang has a lower learning overhead for a .NET guy.

Within these days we invested time in establishing high-level design guidelines for the system to be implemented. High-level goals were:

  • Iterative development with emergent design (we don’t know all the answers)
  • Micro-services with support for experimentation and A/B testing (to get these answers and discard wrong assumptions)
  • Base design evolution on reality and measuring it to validate assumptions
  • Ubiquitous language for communication between services : HTTP with JSON/HTML
  • Any language within the service (as long as it runs on Linux)
  • Designing for a distributed team that wants to spend years learning things, experimenting and playing with cool tech.

Making this work requires a lot of responsibility and ownership, which have to be factored into the design, as well. We currently believe that Micro-Services approach with a bit of Programmer Anarchy might work well for our case, as a foundation for building things up.

程序员瑞典最大的约会网站的开发测试经验

For the upcoming week I plan to continue catching up with Golang (currently reading through The Way To Go) and then start drafting a prototype of low-latency Message Bus with Event Store capabilities and FoundationDb backend (codename BroStore).

December 23 - Language is an implementation detail

Although everything about working at HPC is interesting, last week was quite peculiar on its own.

There was an interesting discussion about use of async pub-sub messaging for communications between micro-services. That’s whatFred George does, for example, with event messages. However, command messages have their own value as well, due to the behaviour that we associate with them (only one message handler could deal with command messages, unlike event messages, where there could be 0 or more).

Yet, after a bit of discussions with Tomas, we discovered that introduction of command messaging breaks our nice decoupling with regards to ease of upgrades, experimenting or continuous delivery. Besides, if needed, you can always implement command messing within the boundaries of the service. This is possible, since we place a clear separation between high-level design decisions (the ones which talk about how mServices should behave and communicate) and implementation details (which govern how mServices should be actually implemented).

For example, here are some high-level design decisions:

  • Protocol for communications – JSON/HTML over HTTP in our case
  • Messaging semantics – async pub/sub in our case
  • Approaches for deployments, versioning them and experimenting with them – rapid iterations, A/B testing, using business metrics as the driver
  • Set of recommended languages and technologies – still evaluating (see below)
  • Design and development priorities – creating fun environment to work in, keeping things real, small and simple
  • Execution and hosting constraints – Linux, clustered in our own DC with geo-replication
  • Additional constraints – low latency and failure-tolerant

Curiously enough, we are still iterating through the suitable languages for implementing new version of HPC (while also addressing the design and domain questions). So for this week I’m going to spend more time learning Haskell (in addition to doing dives into Erlang and Golang during the previous weeks). At the same point, our rewrite will probably start in .NET with micro-services design. Reason for that being – .NET, despite it’s shortcomings and costs is the language where we all would be most productive and could release initial versions fast. This is crucial for gaining real-world feedback for evolving the system. Then, as the need arise, micro-services will be rewritten in one of more Linux-friendly functional languages to:

  • Save on licensing costs.
  • Improve performance and reduce latency.
  • Make code more simple and concise.
  • Stand on the shoulders of giants, reusing ecosystem and communities of languages we choose.

By the way, if you listen to Fred George, he mentions that at one point 150000 lines of Java code where rewritten in 4000 lines of Closure code (or so). Based on my exposure to Haskell so far, I’d say that C# is almost as verbose as Java in this sense.

In other words, languages are treated just like implementation details of the system. Even though there are some recommendations and guidelines, developers should be able to choose the tool they want in order to get the job done in the most efficient way.

January 18 - Moving forward with golang

After a couple of iterations we settled for the go language as the primary language for rewrite of happy pancake from C#. Ideally we’ll converge on Haskell later (that’s something I would really like, due to the powerful type system and high suitability for capturing domain models). However, for the time being the primary language will be go. Reasons for that being:

  • Simplicity of the language and similarity to C#
  • Excellent ecosystem for the development of backend servers
  • Availability of go drivers for FoundationDB
  • Linux development stack (Ubuntu + Sublime/Vim) without large license fees
  • Language is expressive enough for our needs
  • Excellent resources and help tools

Why FoundationDB is so important to us – would be another blog post later (long-story short: it is like fast Redis with proper clustering support and only one data structure – sorted key-value ranges).

There are a few downsides of golang that we are going to live with:

  • Concept of workspaces is somewhat messed up (imagine, that you have to work with two versions of a library). However, this is not nearly as bad as dll and nuget hell in .NET world
  • Absence of generics or type inference that would work as such

Getting started with golang was rather simple. We went with Tomas through:

All of these resources are an easy read (mostly attributed to the simplicity of the language itself). While doing that I setup an Ubuntu (LTS) with Sublime Text 2 andGoSublime package. Given all that, it was relatively easy to start porting layer code for FoundationDB from python to golang.

程序员瑞典最大的约会网站的开发测试经验

I’m still running my dev environment as VM on my MacBookAir, although Ubuntu can live fine with 1GB of RAM, unlike Windows VM that had to ask for 2GB. Plus, since Parallels does not work well with Linux VMs, VMWare Fusion is used.

While working on layer code, I had also to encounter Python along with its REPL. Syntax was a bit odd in the beginning, but quite simple in the long run. No tutorials even needed.

程序员瑞典最大的约会网站的开发测试经验

For the next week I plan to finish porting queue and pub/sub layers for FoundationDB from python to golang. We’ll see how it goes from there.

February 02 - Getting started with FoundationDB

During the last week at HPC my focus has been onFoundationDB. FDB is a nice NoSQL database which has a bunch of great properties:

  • It stores key-value pairs of bytes, where keys are always sorted. You have usual GET/SET/DELETE operations along with range operations that come from sorted key nature
  • Multiple key operations can happen in a transaction
  • Many advanced operations can be implemented as Layers on top of that storage abstraction. There is even a SQL layer for that.
  • FDB scales nicely as you add new nodes to the cluster
  • Cluster of up to 6 nodes can be used for free in production
  • We (or Tomas, to be more precise :]) managed to get 75k write transactions out of a small cluster we setup at the digital Ocean
  • Setting up a cluster is a no-brainer even for a Linux noob like me
  • FDB handles load distribution automatically, it moves data as necessary, too
  • FDB has client libraries for python, golang, erlang, Node.js and even .NET
  • Their team is extremely helpful and humble
  • You can configure level of replication (e.g.: single, double, triple) before write is ACKed
  • FDB can be configured to store views in memory or on disk, transaction store is always durable

I personally really like that FDB is extremely opinionated about what it does (just a replicated transactional key-value storage), but it does this extremely well so far.

We are planning to use FDB as our event store and for persisting view models (which will be replicated across the cluster). I’m actually the one having fun with that implementation. Event Storage itself is a simple abstraction, however making implementation work properly with FDB key-value storage is something that requires better insight into inner workings of FDB. Plus, I get to do that in a go language.

My next week will focus on getting our full planned stack to play together (in an extremely jacky way), so that we could start developing components for the HPC2.

PS: my current development environment looks like this (Ubuntu LTS + “awesome” tiling manager + sublime + GoSublime):

程序员瑞典最大的约会网站的开发测试经验

Febuary 08 - Evolving the stack and learning nanomsg

Last week with HappyPancake was my first full-time week with the team. Time flew fast and left me wishing for more.

I explored nanomsg (glorified sockets) and how it is used in golang. nanomsg is going to be our communication layer between components within the app, hence understanding its usage patterns was important. It was extremely exciting to pair with Pieter on go programming exercises (and also picking up some Sublime/Linux tricks along the way).

While developing my first golang+nanomsg+FDB prototype, I was quite proud of the first code that was written. It was relatively robust, simple and somewhat performant. During a few next days I realised that it was actually an overcomplicated and under performing piece of software. Pieter shared his process about structuring and expressing ideas in golang. Tomas explained how to make that code brutally simple and yet more performant. That was awesome! Previously it would take me years or months before I could arrive to that breath-taking understanding of how stupid I were. With this team everything happens so much faster. Love it.

During the week we set ourselves a goal of building a system which could return uncached reads from the front (reverse proxy) within25ms under the load of 50000 HTTP requests per second , while degrading gracefully under increased load. This, assuming that we run 3 relatively small app servers, 2 reverse proxies and FDB cluster of 5 nodes. Obviously, throwing more hardware to the system, should scale it out. It is nice to be working on a system, where latency is one of the design constraints and having fun is another one.

When I had to leave on Friday evening, Tomas and Pieter were discussing process of getting rid of state in services by pushing it all the way to reverse-proxy (using Lua on nginx to take care of HTTP connections, while preserving true stateless asynchrony over nanomsg on the inside). This approach has a synergy with building resilient system of small systems (aka micro-services architecture) communicating over asynchronous events and continuously evolving (versioning, A/B testing and continuous delivery are among the goals of the goals).

Apparently, during the course of the evening, they refined the approach to make it even more simple and robust. I can’t wait to see what this idea has turned into.

By the way,FoundationDB just hit 2.0 version. They published a golang client library within that release, making use of FDB – a breeze. By the way, as Pieter and Tomas reported, upgrading our test cluster to v2.0 took 4 minutes. ETCD was also bumped to 0.3.0.

February 17 - Designing for throughput and low latency

For the last week I spent most of the time pairing with Pieter, learning more about our the performance and behaviour of our anticipated stack (for the second version of HappyPancake.com). It was thoroughly interesting exercise in systems engineering.

Here is what our anticipated looks like right now:

With this design we want to have 25ms latency of HTTP reads (non-cached, 99% percentile) given the throughput of 50000 requests per second. A/B testing, feature toggling, continuous delivery and live upgrades (with ghost mode and ramp-up) included.

Here is a short summary of lessons learned within the last week:

  • Tomas is an absolute beast when it comes to crunching out small open source libraries
  • It is quite easy to publish statistics from an app and then gather them in a nice WebUI for crunching (using client library fsd to publish to local statsD app via UDP. StatsD currently pushes stats toLibrato Metrics with delay of 10 seconds).
  • HTTP servers in Go are quite good, but can be a pain to extend or augment
  • Nanomsg is really nice and performant, however the documentation is … lacking.
  • Profiling capabilities of Golang are absolutely stunning.
  • Spending a week developing and debugging golang apps, while benchmarking them on a Digital Ocean cluster – teaches you a thing or two about Linux environment. It is awesome.
  • Software engineering is about making theories about how your code will behave in production, then running experiments to validate these theories. You iterate and learn.
  • Pairing up with somebody is an amazing opportunity to transfer knowledge and produce better quality code (I lost track of the number of times I was stunned and humbled by the experience and insight of other team members – so much to learn). We currently use TeamViewer (best image and keyboard sharing experience) and Skype for the voice. Campfire is for group chats (and chat ops).

For the upcoming week I’ll be working on pushing our stack closer to the desired performance numbers (we don’t meat the goal, yet). It is an interesting exercise which forces you to learn a lot and go deep (to the point of tuning the OS).

February 24 - Containers, virtualization and clusters

Last week was a bit hectic. We are waiting for a bunch of datacenters to provide us with the test access to virtualised and hardware servers. These have to be benchmarked in order to see how they perform.

Some time during the week we realised two things:

  • We aren’t going to get decent numbers out of the machines on DigitalOcean
  • Apparently DigitalOcean is using some cheap virtualisation environment which massively underperforms compared to VMWare

This realisation lead us to the point where we started evaluating dedicated hardware option instead of the VMs. We are going to run Docker containers in them anyway, so there is not going to be any vendor or hardware lock-in. Here are a few notes on that:

  • Dedicated hardware is fast
  • Good virtualisation software adds little overhead on top of HW; bad virtualisation – kills any performance
  • Docker containers add very little overhead (should be way below 10% in our scenario) but help a lot with software compartmentalisation

Within the last week I was following the footsteps of Pieter, learning from him and writing my first docker containers. There are a few gotchas, but the entire concept is amazing! It is probably the best IT thing that happened to me since beer. It solves “works on my machine” syndrome in most of the cases, making it extremely easy to work with software both locally and in remote environments. The experience is a lot better than Lokad.CQRS abstractions for file and Azure backends that I came up with earlier.

Eventually, while setting up the containers over and over again, we came to the conclusion that we want to automate the entire thing. Running a script to deploy new versions requires context switching which Tomas and Pieter don’t like (I never been as productive in Linux as these guys, but I start feeling this too). Hence, we are thinking about using eitherDrone or fleet to deploy and manage containers across the cluster.

We will probably be using ubuntu 12.04 LTS for the containers (long-term support and stable code). Trying something likeCoreOS for the host OS seems compelling because it is native to etcd (awesome) and fleet.

We’ll see how it goes. This week is going to be about strengthening our continuous delivery story and getting more numbers from the stack in different configurations.

A few more other highlights from the previous week:

  • We switched to Slack from campfire (it is used for persisted chats between the team). Native client is awesome, works so much better than campfire and Skype group chats
  • wrk is an awesome tool for doing load testing while measuring throughput and latencies

March 19 - Benchmarking and tuning the stack

I focused on testing our current stack, understanding how it behaves under the load and trying to improve it. We are currently running everything in a cloud environment with VMWare virtualization, setting everything up from scratch at the beginning of the day and tearing everything down at the end of the day. This helps to focus on automation from the very start.

Our testing setup is quite simple at the moment:

  • Benchmark Box (2 cores with 4GB RAM) – we run weighttp and wrk load tests from this one.
  • Proxy and Application Boxes (8 cores with 4 GB RAM) – proxy box hosts terminator and web aggregator services, while app box hosts specialized services.
  • FoundationDB Box (2 cores with 5GB RAM) – a single FoundationDB node

Each of the boxes is by default configured with:

  • Ubuntu 12 LTS and upgraded to the latest kernel (docker and FDB need that);
  • Docker is installed (with override to let us manage lifecycle of images);
  • ETCD container is pulled from our repository and installed, using new discovery token for the cluster;
  • Logsd and statsd containers (our logging/statistics daemons) are downloaded and installed on proxy and app.
  • Appropriate services are downloaded and installed on proxy andapp boxes as containers (our build script creates containers of our code, pushing them to a private docker repository)
  • FoundationDB is installed on fdb box.

All services and containers are wired into the Ubuntu upstart (equivalent of windows services management). Whenever a service starts, it interacts with ETCD cluster to publish its own endpoints or get the needed endpoints from it.

So for the last week I was polishing these install scripts (refactoring BASH is a fun exercise, actually) and also performing some tuning and optimization of that code.

Currently we are using plain bash scripts to set up our environment. However bash scripts are just like imperative languages: they tell exactly what you want to do in steps. I’d see that trying out more functional tools might be beneficial for us in the longer term (ansible, puppet, chef or something like that).

We have following baseline scenario right now:

  1. We run weighttp load testing tool on bench with keep-alive, 256 concurrent clients, 2 threads (1 per core) and enough requests to keep everything busy for 10 minutes;
  2. Each http request goes to terminator service on proxy box. Terminator, running basic http server of golang, handles each http request in a new goroutine. It simply serializes request to a message and pushes it to nanobus (our own thin wrapper library around nanomsg for golang). This will create an http context, which consists of a single go channel. Then goroutine will sleep and wait for the response to arrive on that channel. Timeout is another alternative.
  3. Nanobus will add a correlationId to the message and publish it to TCP endpoint via BUS protocol of nanomsg. Semantically this message is event, telling that an http request has arrived.
  4. Any subscribed service can get this message and choose to handle it. In our case there currently is a web aggregator service running in a different container and showing interest in these messages. Nanobus in web will grab the message and dispatch it to associated method (while stripping correlationID).
  5. This method will normally deserialize the request and do something with it. Currently we simply call another downstream service through a nanobus using the same approach. That downstream service is located on another box (for a change) and actually calls FoundationDB node to retrieve stored value.
  6. When web service is done with the processing, it will publish response message back to the BUS socket ofterminator. nanobus will make sure that the proper correlationID is associated with that message.
  7. Nanobus in terminator service will grab all incoming messages on BUS socket and match them against currently outstanding requests via correlationId. If match is found, then we dispatch the response body into the the associated go channel.
  8. http handler method in terminator will be brought back to life by incoming message in go channel. It will write its contents back to the http connection and complete the request. In case of timeout we simply write back Invalid Server Operation.

When I started benchmarking and optimizing this stack we had the following numbers (as reported by our statsD daemon):

  • 12.5k http requests per second handled;
  • 99th percentile of latency: ~18ms (99% of requests take less than 18 ms, as measured from the terminator);
  • CPU load on the proxy box: 9 (1 min average as reported by htop).

Here are some improvements (resulting from a few successful experiments out of dozens of failed ones):

  • Replacing BSON serialization/deserialization in nanobus with simple byte manipulation: +1k requests per second, –1ms in latency (99th), CPU load is reduced by 1;
  • Switching to new libcontainer execution driver in docker: +0.5k requests per second, –0.5ms in latency (99th), CPU load reduced by 0.5;
  • Removing extra byte buffer allocation in nanobus (halfing the number of memory allocations per each nanobus message being sent): +1k requests per second, –1ms in latency (99th), CPU load reduced by 1;
  • Tweaking our statistics capturing library to avoid doing string concatenation in cases where sample is discarded afterwards: +1.5k requests per second, –1ms latency (99th).

Hence, the final results are:

  • 18k http requests per second;
  • ~12.5ms latency (99th percentile).

Our next steps would be to add more realistic load to this stack (like dealing with profiles, news feeds and messages), while watching the numbers go down and trying to bring them back up.

April 07 - Change of plans

Monday came with the change of plans in our team. Tomas and Pieter realized that although our planned architecture looks really awesome (with all that messaging and dynamic component switching) it is too futuristic for our current goals. We want to migrate out of .NET+SQL, for a start. We also want to learn more about our domain before investing a lot of time to optimize it to perfection.

We archived our existing prototype code and switched gears to:

  • Single git repository for the entire project
  • Single process application with in-memory components
  • Event-driven domain

Our short-term goal is to capture HappyPancake domain in golang in the simplest possible way. Then we’ll improve design from there.

I spent part of the week working on our EventStore (which is just a layer on top of FoundationDB). After benchmarking it we encountered a very foundational problem: it is hard to append sequentially to a global event stream (transaction log) which is replicated on a cluster. You either get a horrible throughput or you need to sacrifice consistency which affects reading speed. Another alternative is to find something else to sacrifice in a way that has the least possible effect on event-sourced application.

This was an interesting tech challenge, neatly isolated and spinning in our minds. We agreed to take some time to think about it before moving forward with the implementation. Today is the day we share our ideas.

I also spent some time drafting a simple prototype of basic HappyPancake functionality decomposed into a bunch of event-driven components. It was an extremely rewarding experience to see concepts from C# being expressed in go language.

This weekend I went to Chelyabinsk to deliver a talk on software design (masked under the title of “micro-services in .NET”) at dotnetconf.

Tomas was mostly dealing with the UI and UX, while sharing in some papers on algorithms and maintaining the first version of HappyPancake (something we have been spared from).

Pieter was reevaluating golang web frameworks while also trying to make them work with PJAX for fast updates in the web UI.

April 14 - Back to the basics

By the beginning of the last week I ported infrastructure for event-driven design (with aggregates, projections and in-memory views) from C# to golang.

However, later on it was agreed that going through the event-driven modeling is yet not the fastest and simplest route to the working code. So this code was completely discarded (we could get back to it later) and we started evaluating something even more simple – CRUD approach with CouchDB and MySQL.

FoundationDB, does not have any projection or querying capabilities at the moment. This means additional effort required to design and maintain those and might be a premature optimization at this point.

While thinking about storage constraints in our design, I’ve been focusing on messaging functionality for the HappyPancake. Currently we have 150000 messages going through per day with text size up to 2000 characters (served by a large MS SQL database). 20000 users are usually chatting at the same time.

Ideally, next version would make this experience more enjoyable and interactive. More messages sent == better for the business.

I focused on prototyping a simple chat, where messages and presence notifications are served to the client with long polling http requests. CouchDB and mySQL were evaluated as storage engines at this point.

Pieter, at the same time, focused on the storage problem from the perspective of profiles, storing and updating them, serving through http as documents and search feeds. We discovered that our favorite http library in go “Revel” can barely serve 4k requests per second due to all the magic it provides (including templates). Bare http server of golang can serve up to uncached 17k requests (to resources with templates) per second on the same machine. So there are some trade-offs to be made.

I personally think we could stock to basic http library just fine, since Tomas is pushing extra effort to make our UX and UI extremely simple.

CouchDB is a really old document database that has nice master-master replication, support for map-reduce and query engine. It is even used to support some experiments on Large Hadron Collider. To make things even more nice, CouchDB exposes change streams per database, to which you could subscribe from the client code. API is served over HTTP, while the core is written in Erlang.

Unfortunately CouchDB didn’t fit well to a simple CRUD scenario. Reason being – CouchDB is IO bound, all caching is delegated to the operating system.

mySQL was, surprisingly enough, another contender for our storage engine. It previously felt to me that this is a legacy database from the early days of Internet. However, after starting to read “High Performance mySQL”, I quickly realized that this exactly is its strongest point. This database was optimized and polished by the biggest internet companies in the world. It is rock-solid for both SQL and noSQL. Performance is predictable and tooling is rich.

Yet, mySQL can’t do miracles if your IO operations are limited by the virtual environments. We can have no more than ~400 operations per second on Glesys machines.

So all through the weekend I’ve been searching for articles on clustered messaging architectures at a large scale, trying to figure out the simplest approach that would fit two constraints:

  • Provide fast and responsive messaging UX implementation of which is capable of serving ~20000 new long polling requests per second;
  • Have clustering capabilities (multiple app servers handling the load);
  • Work with a relatively slow storage engine, using no more than 10-20 requests per second.

Fortunately for us, we can live with:

  • Occasional write failures are tolerable
  • Cached data is fine in a lot of cases
  • Systems are not mission critical

If you think about it (and sleep over it a few nights, too), these design “relaxations” allow to deal with our domain with quite some ease: we can store messages and presence notifications simply in memory (replicated for some degree of reliability) going to the disk only for batched writes and cache misses (usually loading conversations that happened quite a while ago). Amount of memory dedicated for message cache can be tuned to find the sweet spot here.

So, at this point, we don’t really care about the choice of the storage engine for the purposes of messaging, presence and notifications: CouchDB, mySQL or FoundationDB. Each one would work just fine. However, I would personally prefer mySQL at this point, since it is easier to capture the domain.

Some reading

Besides that, I started reading “Programming Distributed Computing Systems” by Carlos A. Varela, which is a very foundational and intense book. Highly recommended.

April 21 - Head of a social site - messaging

At the beginning of the last week I had a simple responsive prototype of a chat server. It was a simple in-memory implementation in go, delivering messages and “user is typing” updates instantly over long polling http requests.

Obviously, a single server chat application wouldn’t be enough for HappyPancake.com. We want to serve live updates to 20000 online visitors (numbers from the current version of the web site), while also adding in some headroom for the scalability and fault tolerance.

So the last week was dedicated to search, trials and failures on the way to multi-node clustered chat server.

I started by reading a lot about existing chat designs and approaches outside of golang. Erlang and Akka were recurring theme here. So I tried to move forward by implementing something like akka’s actor model (Singleton actor patter) in golang while usingLeader Election module of ETCD.

What is ECTD? ETCD is a highly-available key-value storage for shared configuration and service discovery. It is written in GO and uses RAFT algorithm (simpler version of PAXOS) for maintaining consensus across the cluster.

That was a dead-end:

  • Re-implementing akka in golang is a huge effort (too many moving parts that are easy to get wrong)
  • Leader Election module in ETCD is theoretically nice. Yet, in practice it is considered as experimental by CoreOS team. Besides, go-etcd client library does not support it, yet.

At some point we even pondered if switching to akka was a viable strategy. However, NSQ messaging platform (along with the other projects from bitly) served as an inspiration of getting things done under the golang. A few more days of consuming information on the design and evolution of social networks and I had an extremely simple working design of a multi-node chat.

There were 2 small “break-throughs” on the way:

  • You can use basic functionality of ETCD keys (with TTLs and CompareAndSwap) to implement entity ownership across the cluster
  • We don’t really need a concept of actors to implement a scalable cluster. Dead-simple semantic of Http redirects would do the job just fine.

All chat conversations are associated with one out of N (where N is an arbitrary number) virtual chat groups through consistent hash of the conversation ID. A chat group can either be owned by a node (as indicated by the renewed lease on ETCD key) and be available in its memory. All other nodes will know that because of the ETCD registry and will redirect requests to that node.

Alternatively, a chat group can be owned by nobody (in case of cold cache or if the owning node is down). Then a random node (smarter algorithm could be plugged later) would take ownership of that chat group.

Why bother with concept of chat groups? Querying ownership of 100000 of chat conversations can be pretty expensive, besides we would need to send heartbeats for each of this conversations. It is easier to allocate N chat groups, where N is a fixed number. This can be changed later, though.

Result of all that : a dead-simple chat prototype that runs on 1-K nodes, where nodes are extremely simple and can discover each other dynamically, sharing the load. If a node dies – another one would take ownership of the chat conversation.

All chats are reactive. “user is typing” notifications and actual messages are immediately pushed to the UX over http long-polling connections. More types of events will be pushed over these channels later.

Part of that simplicity comes from the fact that golang simplifies work with concurrency and message passing. For example, the snippet below flushes all incoming messages to disk in batches of 100. If there were no messages for a second, it will also flush captured messages.

var buffer []*Record  for {      select {      case r := <-spill:          buffer = append(buffer, r)          if len(buffer) >= 100 {              persistMessages(buffer)              buffer = make([]*Record, 0)          }      case <-time.After(time.Second):          if len(buffer) > 0 {              persistMessages(buffer)              buffer = make([]*Record, 0)          }      }  }

For this week I plan to move forward:

  • Finish implementing a proper node failover (currently nodes don’t load chat history from FoundationDB)
  • Made nodes inter-connected between each other (we actually need to publish notifications for a user in real-time, if he gets a message, flirt or a visit from another user). NSQ (real-time distributed messaging platform in golang by bitty) seems like a really nice fit here.

During the week I also did some benchmarking of ID generation algorithms in golang for ourevent store layer on FoundationDB (not a big difference, actually). Here is the speed of appends (1 event in a transaction, 200 bytes per event, ES running in 1 node on Glesys with 5GB RAM, 2 Cores and VMWare virtualization; client – 4 core VM with 8GB RAM):

10 goroutines : 1k per second, 10ms latency (99) 50 goroutines :  3.5k per second, 12ms latency (99) 100 goroutines : 5k per second,  20ms latency (99) 250 goroutines : 7k per second, 35ms latency  (99)

Meanwhile, Pieter was working on profiles, news feeds and registration flows. He was stressing out different database engines by uploading data from the existing user base of happypancake.com. There is a lot to learn about the behavior of different technologies in our scenarios. In the end we seem to be converging back on FoundationDB as the primary storage. Tomas was mostly busy with admin, UI design and maintenance of the first version (protecting Pieter and me from the boring stuff).

Here is some reading of the week:

April 28 - Event-driven week

The last week started as planned. First, I implemented a persistence for a simple chat service, then moved forward with multi-master design for application nodes. In this design each application node can handle any request the same way. This approach:

  • simplifies the design;
  • does not prevent us from partitioning work between nodes later (e.g.: based on a consistent hashing, keep user-X on nodes 4,5 and 6);
  • forces to think about communication between the nodes.

The most interesting part was about the UX flows. For example, in a newsfeed page we want to :

  1. Figure out the current version of the newsfeed for the user (say v256)
  2. Load X newsfeed records from the past, up to v256.
  3. Subscribe to the real-time updates feed for all new items starting from v256

There is an additional caveat. While loading history of activities, we merely display them on the screen (with the capability of going back). However, activities that come in real-time need more complicated dispatch:

  • Incoming messages need to pop-up as notifications and update unread message count in the UI.
  • Incoming flirts and profile visits have to go directly into the newsfeed.

Modeling these behaviors lead to some deeper insights in the domain. By the beginning of the week I wasn’t even able to articulate them properly :]

Caveats of Event Sourcing

Tracking version numbers in a reliable way was also a big challenge initially. The problem originated in the fact that our events are generated on multiple nodes. We don’t have a single source of truth in our application, since achieving that would require either consensus in a cluster or using a single master to establish a strict order of events (like Greg’s EventStore does, for example). Both approaches are quite expensive for high throughput, since you can’t beat the laws of physics (unless you cheat with atomic clocks, like Google Spanner)

Initially, I implemented a simple equivalent of vector clocks for tracking version numbers of a state (to handle situation of reliably comparing state versions in cases, where different nodes will get events in different order). However, after a discussion with Tomas we agreed to switch to simple timestamps, which sacrifice precision for simplicity. We are ok with loosing 1 message out of 10000 in newsfeed, as long as it always shows up in the chat window in the end.

NSQ

For communication tech I picked NSQ messaging platform, since it already has a lot of tooling that boosts productivity. NSQ is used only as glorified BUS sockets with buffering and nice UI. Hence, if Tomas later on manages to push towards nanomsg, we could do that with quite an ease.

A nice benefit of using something like nanomsg with ETCD or NSQ is that this system does not have a single point of failure. All communications are peer-to-peer. This increases reliability of the overall system and eliminates some bottlenecks.

Micro-services

Understanding of micro-services keeps on evolving in an predictable direction. We outgrew approaches like “event-sourcing in every component” and “CRUD CQRS everywhere” to a more fine-grained and balanced point of view. A component can do whatever it wants with the storage, as long as it publishes events out and keeps its privates hidden.

Even in a real-time domain (where everything screams “reactive” and “event-driven”), there are certain benefits in implementing certain components in a simple CRUD fashion. This is especially true in case where you can use a scalable multi-master database as your storage backend.

Pieter was exactly working on the CRUD/CQRS part of our design, modeling basic interactions (registration, login, profile editing and newsfeed) on top of FoundationDB. This also involved getting used to the existing HPC data, different approaches in FoundationDB and go web frameworks.

Tomas was mostly busy with the admin work, supporting our R&D and gaining more insights about existing version of HPC (with the purpose of simplifying or removing features that are not helpful or aren’t used at all).

Plans

This week is going to be a bit shorter for me - we have May 1st and 2nd as holidays in Russia. Still, I will try to finish modeling event-driven interactions for the newsfeed and chat. This would involve UX side (I still didn’t fit transient events like user typing notification into the last prototype) plus implementing a decent event persistence strategy. The latter would probably involve further tweaking our event storage layer for FoundationDB, since I didn’t address scenario, where the same event can be appended to the event storage from multiple machines. We want to save events in batches, while avoiding any conflicts caused by appending the same event in different transactions.

May 05 - Reactive prototype

Last week, as planned, was quite short but very interesting.

Development of the reactive prototype at some moment hit a complexity point where a dead-simple hacky approach could no longer work. Although go language (with its simple flavor of behavior composition) allowed to go pretty far on that route, in order to move forward, I had to bite the bullet and refactor things from a big ball of mud to a collection of components.

That’s when I realized that I already enjoy coding in golang as much as I enjoyed working with C# in Visual Studio with ReSharper after 8 years of practice in it.

After that refactoring I was able to move forward with the domain exploration (in case of HappyPancake domain includes both the social networking and technical peculiarities of developing reactive application at a social scale).

One of the interesting aspects of the development was the interplay between:

  • reactive nature of this prototype (new notifications are rendered on the server and pushed to the client through http polling feed);
  • different ways of handling the same event from different contexts and screens (e.g.: a chat message would be appended to the conversation in a chat screen but it will show up as a notification box in another screen);
  • different ways of persisting and delivering information to the users (e.g.: chat history is loaded via range read from FoundationDB, while all updates to this history are pushed to the client through the notification feed);
  • focus on reducing CPU and latency for the mobile devices (e.g. last 75 messages in a chat come pre-rendered in the page HTML on the first page request, while new messages are pushed incrementally by appending server-generated HTML to the DOM);
  • our desire to have graceful degradation of the user experience for some of the older mobile platforms (users could still get some chat experience even if javascript does not work at all).

At this point, I think, we have a pretty good understanding of the domain around messaging and notification feeds. We have:

  • a bunch of implementations and use cases captured in the tangible and working code;
  • strategy for scaling the performance in a variety of scenarios (with known price to pay for that in terms of complexity or consistency);
  • some understanding of how we would deal with devops later on.

Meanwhile, Pieter was working on the other half of HappyPancake – understanding and developing interactions around document based flows in the social network – registration, logins, profile editing and reviewing. All with PJAX and basic http handlers (we discarded Revel, since it does too much CPU-intensive magic).

Tomas, as usually, focused on backing up our development. He took care of the v1 maintenance, campaigns and also invested in capturing use cases for us to move forward.

It was extremely interesting to sync up with Tomas and Pieter occasionally, sharing concerns and discoveries along the road. It felt like getting an instantaneous deeper insight into the problem we are trying to tackle here.

Another really awesome part of the last week was about gradual transition from purely technical use cases (consistency, availability and latency issues) to practical use cases that matter to our users (flirts, messages, visits etc). Although technology is an important part of HappyPancake, users are the domain that we ultimately trying to understand and master.

The upcoming week will be a bit longer than the previous for me, but still only 4 days (May 9th is another holiday).

We plan to start my prototype into Pieter’s prototype, while moving forward and adding more use cases. I hope to also move forward with newsfeeds. They require a balance between consistency and availability that is different from notifications and chat messages (more like the Instagram news feeds).

May 12 - Tactical DDD

I started merging bits of my reactive prototype into the document-driven prototype of HappyPancake that Pieter was working on. While at that, we spent a lot of time discussing the design and iterating over it.

It was really cool to see how the structure of the solution shifted focus from technical model to functional model. Previously our golang packages (which roughly map to lightweight .NET projects) contained files grouped by their technical intent (e.g.: controllers, models, documents). This added friction to development:

  • a lot of context switching was required in order to work on a single use case, touching multiple packages;

    • solution structure enforced certain architecture style upon the codebase (when you have folders like models, controllers, views and documents, naturally you will be trying to fit your implementation into these);
    </li>
  • merge conflicts were unavoidable, since too much code was shared.

  • </ul>

    Over the course of the week, we switched to a different design, aligning packages with use cases. You might consider this to be a tactical domain-driven design (we didn’t touch any of the strategic parts like Bounded Contexts or Ubiquitous language, since our core domain is extremely simple).

    Golang packages get tightly aligned with our use cases. They either implement cases directly (e.g.: by exposing http handlers to render the UI and process POST requests from the browser) or they help other packages to fulfill their role by providing supporting functionality or structures (e.g. authentication utils, http helper methods, core value objects).

    Of course, the road wasn’t all about roses and pretty ladies –you can’t just split codebase between a bunch of folders and hope that all will work and make sense. It is never that easy.

    We had a lot of discussions like :

    • How do we decompose functionality into multiple packages which will work together to implement these use cases?
    • This code does not make any sense, what are we doing wrong?
    • How do we name this thingy?
    • What is the simplest approach to implement these use cases?
    • How can we work together on this functionality?

    I really enjoyed every minute of these discussions with Pieter, they were focused on the problem domain instead of fiddling around artificial architectural constraints imposed by the overall design. Besides, so far, we were able to resolve these questions and thread the thin line between over-engineered monolith and messy big ball of mud.

    We are not sure if we’ll be able to walk this path later, yet so fareach step led to a deeper insight in the domain of HappyPancake (just like domain-driven design promises). There are a few really cool things about our current design:

    • it is extremely easy to collaborate on the code : there are almost no merge conflicts;
    • we are free to experiment with implementation styles within packages without turning solution into a mess;
    • golang is designed to support small and focused packages, this shows up frequently as yet another tiny and deeply satisfying moment.

    The most important part is : our code is a reflection of domain knowledge captured in a tangible form. Codebase is structured around that knowledge and not vice versa.

    In the meanwhile Tomas was busy with administrative work and HPC1. Towards the end of the week he also got a chance to start working on the HTML design of HPC2 in a stealth mode. Pieter and me are both really anxious to see what comes out of this work.

    Also on Friday we were interviewed by a couple of students on the topic of CQRS. I’d think that our joint statement was something like “CQRS is new name for denormalization with a little recollection of what happened before 2010”.

    May 24 - Emergent design faces reality

    Last two weeks were packed. We are working hard to have a limited prototype of the application ready and available for a demo in June. So far things look really good for the schedule!

    Collaborative design process

    We chat frequently with Pieter, discussing things small and big: starting from component design to a naming choice of some variable or just a weird gut feeling about some code.

    I found out that disagreements with Pieter are especially productive and exciting. I’m really glad that he has patience to put through with my stupid questions.

    Here is one example.

    A few days ago Pieter started working on profile functionality and began introducing there PhotoUrl fields. That immediately gave me the big shivers, since I considered this to be a misleading design. Profile service was responsible for managing and providing published user information like gender, birthday or name. Photo urls have got to be a different concern! Bleeding them into the component responsible for creating and providing profile info felt like an over-complication, compared to the other components (they are clean and focused).

    I tried to explain these reasons to Pieter, but that didn’t get us far. He replied that it was ok to denormalize and mock some data within the profile service, since it would help him to get the profile viewing page faster. In response I tried to suggest to create mock stubs for photo urls in a dedicated photo component.

    This went on for a while. Looking at the code together through ScreenHero didn’t help much either.

    Some progress started only when we started talking about things in terms “this gives me shivers”, while trying to understand why each other sees things differently.

    As it turned out, we had different perspectives on decomposition of the components. I had in mind purely vertical responsibility for the profile component, where it would have all layers of an N-layered app along with full responsibilities : creating data, persisting it locally, publishing events, providing HTTP handler for the UX. All that, while focusing on a small and coherent set of behaviors around public user profiles.

    At the same time Pieter was working with the UX. He was interested in a design decomposition which would give him the component that would focus only on maintaining a cache of all user-related information for the purpose of serving profile pages and providing that information to the other components. That component would have a lot of data, but it would not contain any complex business rules – mostly event subscriptions and denormalized read models.

    Seeing this difference was a huge step. I also needed that component (e.g. when you have a news feed and need to enrich entries in it with beautiful profile photos along with name, gender and age for each user). However, since I wasn’t aware of such distinction in our domain, I actually misused a bunch of components for this purpose.

    While flushing out boundaries and contracts of this new profile component we also touched it’s interactions with the future components, which are not even available in the current code (e.g.: review and draft). We talked about naming, responsibilities, contracts – all things except for the implementation (which would be trivial at that point). We even made explicit things like :

    Ok, so we don’t have draft and review components in our codebase this week, however we will pretend that they exist by manually publishing events from them in our ‘prepopulateDB’ script. Since the other components subscribe to these events, they will not even notice any change when we introduce actual implementations. And since we model events from the perspective of the consumers, they will be useful.

    A better and more clear design emerged through this process, things clicking into the place like pieces of a puzzle.

    I find this process truly astonishing : you use codebase to drive exploration of the domain and also capture a deeper insight that is obtained during that process. Emerging design is a beautiful side-effect of that process.

    Design constraints

    Such process would not be possible without the design constraints which fuel and direct creativity. Here are a few that are important in our case:

    • Distributed development team of three people, working remotely on the same codebase in a single github repository;
    • mentality of golang, which forces us to think in terms of tiny packages with short and clear names;
    • requirement to have a demo version in June and a working Beta in September;
    • shared belief in the power of simplicity;
    • high performance and scalability requirements, which we *must not*optimize for right now (since that put us behind the schedule for the June demo).

    Optimize for future performance

    I find it particularly interesting to optimize design for future performance optimizations, while consciously writing code that is designed for short-term evolvability (and hence is hacky and slow). This forces you to think about isolating this hacky code, preparing it for future replacement and possible optimization strategies.

    It is almost as if that non-existent better code was written behind the lines and continuously evolved every time you touch the component or think about it. It is impossible to forget about that, since actual code is so inefficient, just like the caterpillar.

    After a few iterations you end up with the component that is designed:

    • to have high evolvability in the short term
    • to be optimized in the longer term, making a bunch of strategies available (starting from a denormalized read model up to a in-memory cache across all nodes in the cluster, invalidated by messages).

    Making it all real

    All this process is not only fun, but it also tightly tied to the real world. Tomas makes sure of that. First of all, he acts as the domain expert and the stakeholder in the product, setting constraints and priorities, sharing insight. He also works on the vision of the product from the user perspective, capturing concepts in a tangible form of HTML templates which we started merging into the codebase.

    These HTML templates started showing up a few ago. They made _Pieter _and me feel as if New Year came early this year:

    • it is awesome to see a real product instead of hacky UI;
    • UX easily communicates important requirements that could be missed otherwise (e.g. “gender” symbol and “is online” highlight for every author in the newsfeed entry).

    In the end

    We keep saying: “let’s see how much our approach will hold before it becomes a problem”, however so far it holds up pretty well. Architecture, technology and other irrelevant implementation details have changed more than once during this period (e.g.: during the last weeks we switched from FDB to CRUD with shared transactions to event-driven CRUD (no event-sourcing, though). Design still supports growth of understanding and product through these minor perturbations.

    June 01 - MVP Features

    This week we were pushing forward major features missing from Minimum Viable Product for the demo in June. The progress was quite good, even ahead of the schedule. I attribute that to the design we came up with for the project.

    Pieter focused on introducing infinite scrolling to our feeds:alerts, news and diary. These feeds are provided by the separate golang packages of the same names. They don’t own any data, but rather project incoming events into a local storage (mySQL tables used as key-value storage with some indexing) and expose an HTTP handler to render all the HTML for browsing these feeds.

    When we ran out of the things to do for the MVP, Pieter switched to implementing draft, register and review packages. Previously we assigned future responsibilities to them and established their contracts in form of events that cross boundary of these packages. These events were mocked by populateDB script and consumed by the other packages. This allowed to refine the design multiple times even before we started coding in this version.

    Tomas continued acting as Santa this week, working hard on the new HTML templates for the project, while also refining some of the old ones. These templates feature responsive UI, making them ideal for devices with small screen sizes (half of our visitors use them). Later on we could adjust HTML to work nicely for the desktop apps as well.

    It felt really awesome for me to skin diary, alerts and news with these new templates along with my favorite chat package. This process actually granted additional design insights:

    • we can’t generate HTML of the feed items in advance, since we need to embed things like on-line status, current profile photo;
    • while rendering final HTML for the feeds, profile service is queried for enrichment information dozens of times per render - I had to implement a simple in-memory cache with cluster-wide eviction of invalid items (driven by the events);
    • we no longer could use application-wide long polling feed for updating chat pages in real-time, since this feed had to contain specific HTML templates and behaviors. Long-polling buffers had to be moved to chat, rewritten and enhanced with events likeuser-read-message.

    There are still a few missing bits and pieces related to the UI ofprofile views and chat conversations, but these are going to be easy to implement once we have the HTML templates to fill them in.

    For the upcoming week I will probably be busy with implementing nav package (it serves navbar html which is reactively highlighted whenever there is some new content for the user to consume). Ideally, we’ll also tackle rendering of the ads into the feeds, since this is the most valuable feature for the business.

    At some point next week we might start enhancing our solution with package-level event-driven tests expressed in the form of specifications. We currently have such tests implicitly (in the form of event streams generated by populateDB script), however there is value it making them explicit.

    June 09 - Almost demo

    The error of my ways

    We are getting closer to the demo date, which is scheduled to be next Monday, and I did a big mistake last week. Instead of thinking for the whole team I focused only on the design and backend development.

    It would’ve been better if instead I tried to go out and unload some burden from Tomas, who was swamped with the other things this week. This way we could’ve avoided over-delivering on the back-end while getting behind the schedule on the UI design (which is usually the most visible and tangible part of any product).

    I’ll try to pick up more skills in HTML+CSS and UI design in the upcoming days to have skills spread more evenly across our small team.

    Features delivered

    During the last week I added continuous integration server (using Drone on a VM) and introduced a shared staging server which could be used for demo.

    Drone IO

    That server also has infrastructure for capturing various metrics from our application code, aggregating them and pushing to a nice dashboard UI.

    Librato

    I introduced nav package responsible for maintaining navigation menu at the top of the page. Some items on that menu are highlighted in real-time, as new things arrive for the user to check out. Newly registered users now have a newsfeed that is pre-filled with interesting things to check out (as determined by our matching logic). Plus, now it is possible to see photos in the profile pages, like them and send flirts.

    Ability to register is something Pieter delivered this week, along with draft implementation of review service. He also came up with a really nice implementation for our online service, responsible for maintaining a list of currently active users across the cluster.

    Retrospective

    At this point, we have a working pre-alpha version with core functionality either implemented or envisioned in detail. We didn’t burden the code with any performance optimizations, keeping it simple and capable of fast evolution.

    Performance optimizations, if introduced to immature software design, could hinder or prevent its growth to maturity.

    Technically, the implementation is extremely simple: a single executable application written in golang with mySQL persistence and in-memory bus. It exposes HTTP endpoint serving web pages and could be switched to clustered mode (if put behind load balancer and plugged to a proper messaging platform).

    This implementation is more developed from the design perspective : it is decomposed into simple golang packages which are designed to work together to deliver use cases of a dating web-site. These packages tend to be small - merely ~300-400 lines of code on average, including HTML templates). Majority of these packages are full vertical slices, encapsulating implementation details starting from the storage model and up to HTML rendering details within the http handlers.

    Concepts within the code map to the domain model rather well. They are quite focused and simple, thanks to hours spent working over them with Pieter.

    However, vocabulary could benefit from a better alignment with the business model. As Tomas mentioned, we managed to drift from original domain model during the development process. That is something we could fix after the demo.

    These design concepts are very prominent in the contracts of packages: published events and exposed services. There are quite a few of DDD Value objects as well.

    Design approach still seems to hold pretty well, although we are getting close to the next strain point: some packages get too complicated and would benefit from better tests at the contract level. Something like event-driven given-when-then specifications could work really well here. Adding such tests is something I’m looking forward to after the demo as well.

    June 13 - Our First Demo

    We finally had our demo last week. As it always happens in practice, nothing went according to the theory.

    Unexpected problems

    Two big problems surfaced right before the scheduled demo time.

    First of all, RAID on one of the production databases (HPC1) suddenly died. This required full attention of Tomas, taking him away from the demo preparations.

    Second, I discovered that JavaScript part of chat (which I implemented) gets horribly messed up by subsequent PJAX page jumps. Fortunately, disabling PJAX on chat navigation links solved the problem in the short term. In the longer term, I’ll need to pick up more Javascript skills. Tomas already recommended me to check outJavascript: The Good Parts.

    Despite these issues, together with Pieter we cleaned up the HPC2 for the demo. Tomas did an awesome job presenting the product and the vision behind it, which bought us trust from the stake-holders for moving forward. They loved it.

    We plan to have demos on a monthly basis from this point.

    NoSQL in SQL

    During the week we decided to give a try to PostgreSQL, which seems to have a slightly better fit to our needs, than mySQL:

    • great replication story (e.g. “HotStandby and repmgr”);
    • mature drivers in golang (if compared to MySQL);
    • binary protocol that does not suffer from legacy issues like MySQL API does;
    • more polished usage experience (if compared to MySQL);
    • there is a book on PostgreSQL High Performance, which looks as good as the one I read on MySQL.

    PostgreSQL also benefits from being one of the widely used databases (although it probably has fewer installs than mySQL).

    Replacing MySQL with PostgreSQL was a simple thing, since we use SQL storage mostly for NoSQL purposes anyway.

    Using SQL for NoSQL gives us the best of the two worlds: mature ecosystem, polished experience and transactions of SQL along with ease of schema-less development from NoSQL.

    By the end of the week I migrated almost the entire application to PostgreSQL. Design decomposition into small and focused packages (with logically isolated storage) really helped to move forward.

    Next week I plan to finish the migration and improve test coverage in scenarios that were proven to be tricky during this migration.

    So far, PostgreSQL feels more comfortable than MySQL. If this feeling proves to be wrong, we could always jump back or try something else.

    Being the worst on errors and panics

    Sometime during the week, Pieter brought up the question of usingpanic vs error in our code. In golang it is idiomatic when functions return a tuple of result and error:

    func Sqrt(f float64) (float64, error) {      if f < 0 {          return 0, errors.New("math: square root of negative number")      }      // implementation  }

    You can also issue panic which would stop the ordinary low of control and start going back in the call chain until recover statement is expected or the program crashes.

    Since I was burned pretty badly with Exceptions in .NET while working with cloud environments at Lokad (everything is a subject to transient failure at some point, so you have to really design for failure), I tried to avoid ‘panics’ in golang all-together. Instead, almost every function was returning a tuple of result and an error, problems were explicitly bubbled up.

    This lead to a lot of unnecessary error checking and some meaningless errors that were pretty hard to trace (since errors in golang do not have a stack trace).

    Thankfully Tomas and Pieter patiently explained that it is OK to throw panics even in the scenarios which would later require a proper error handling with a flow control. Initially this felt like a huge meaningless hack, but eventually it all “clicked”.

    Refactoring with this new design insight already makes the code more simple and fit the future evolution (which is required by the current stage in a life-cycle of the project).

    Pieter also started cleaning up the language in our codebase, making it more aligned with the reality. This is a big effort involving a lot of merge conflicts, but the results are definitely worth it.

    Becoming a better developer through your IDE

    During last weeks I invested bits of time to learn about Emacs and customize it to my needs. One of the awesome discussions with Pieter on this topic helped to realize the importance of such IDE tailoring for personal growth as a developer.

    As you probably know, Emacs is almost unusable for development out-of-the-box (vim, even more so). You need to tweak configuration files, pick plugins and wire them together. Most importantly, you need to make dozens of decisions on how you are going to use this contraption for the development.

    That’s what I used to hate about Emacs before, thinking that Visual Studio with ReSharper gave me everything that a developer would ever need.

    I came to realize that setting up your integrated development environment from the scratch forces you to become more aware about the actual process of development. You start thinking even about such simple things as organization of files in a project and how you are going to navigate between them. Or, how you are going to refactor your project in the absence of solution-wide analysis and renaming provided by ReSharper.

    Such troubles affect your everyday coding process, pushing design towards greater decomposition and simplicity. Ultimately, this leads to better understanding.

    In the end, Pieter got so inspired by our insights that he also decided to ditch Sublime, giving a try to Vim. We are going to compare our setups and development experiences as we progress through the project. I believe, this is going to lead to even deeper insight for us.

    June 30 - Scala, Modular Design and RabbitMQ

    Our system is event-driven in nature. Almost everything that happens is an observation of some fact: message-sent, photo-liked,profile-visisted. These facts are generated in streams by users interacting with a system. Due to the nature of human interactions, there is little concurrency in these streams and it is ok for them to be eventually consistent. In other words:

    1. A user is likely to interact with the site through one device and a single browser page at a time.
    2. While communicating through the system, users don’t see each other and don’t know how fast the other party responds. If a system takes 1 second to process and deliver each interaction then probably nobody will notice.
    3. The system should feel responsive and immediately consistent (especially while viewing your writes on the profile page and chatting).

    These considerations are very aligned with designs based on reactive and event-driven approaches. During the last 2 weeks we played with multiple implementation ideas of that:

    1. Use replayable event streams for replicating state between modules.
    2. Use either FDB-based event storage (which we already have) or the one based on apache Kafka.
    3. Use a messaging middleware cabale of durable message delivery across the cluster with a decent failover (read as “RabbitMQ”)
    4. Use a pub-sub system without a single point of failure and relaxed message delivery guarantees (read as NSQ or Nanomsg with ETCD).

    Each of these approaches has its own benefits and some related complexity:

    • mental - how easy or hard is it to reason about the system;
    • development - how much plumbing code we will have to write;
    • operational - how easy or hard will it be to run it in production.

    Obviously, we are trying to find approaches which reduce complexity and allow us to focus on delivering business features.

    Scala Theorem

    While talking about Apache Kafka Tomas had an idea of switching the entire codebase to Scala and JVM. Java has a lot of big supporters and a large set of great solutions fit for us. A few days last week were dedicated to evaluation on how easy or hard would it be to drop all go code and switch to Scala / JVM. Here are the conclusions:

    • Scala is a nice language, although builds are insanely long slow (if compared to sub-second builds in golang).
    • Porting our core domain code to Scala is not going to be a problem, it could probably be done in a week (code is by-product of our design process).
    • Devil is in the details, learning the rest of JVM stack is going to take a lot more time than that (e.g.: how do we setup zookeeper for Apache Kafka or what is the idiomatic approach to build modular web front-end with Java?).

    In the end, switching to Scala was ruled out of the question for now. Even though this switch has its long-term benefits, it would delay our short-term schedule too much. Not worth it. Besides, Java stack seems to introduce a lot of development friction hurting rapid development and code evolution. These are essential for us right now.

    RabbitMQ

    We also switched to RabbitMQ for our messaging needs - Pieter single-handedly coded bus implementation which plugged into our bus interface and worked out-of-the-box. Previous implementation used in-memory channels.

    So far RabbitMQ is used merely do push events reliably between the modules:

    • all modules publish events to the same exchange;
    • each module on startup can setup its own binding and routing rules to handle interesting events.

    Although we no longer consider using event streams for replaying events as part of the development process, we could still have a dedicated audit log. This can be done by setting up a dedicated module to persist all messages, partitioning them by user id.

    Modules

    We spent some time discussing our design with Pieter. One of the important discoveries was related to a deeper insight into Modules. Previously we talked about our system using components,services, packages interchangeably. This was partially influenced by the term micro-services which was one of the ideas behind our current design. Some confusion came from that.

    Instead of “micro-services architecture” at HPC we started talking about “decomposing system into focused modules which expose services”

    These weeks we were able to refine our terminology, starting to clarify the codebase as well:

    • our application is composed from modules - tangible way to structure code and visual way to group design concepts into;
    • we align modules at the design level and modules in the code - they share the same boundaries and names;
    • at design level modules are boxes that have associated behavior, we need them to contain complexity and decompose our design into small concepts that are easy to reason and talk about;
    • in the codebase our modules are represented by folders which also are treated as packages and namespaces in golang;
    • we like to keep our modules small, focused and decoupled, this requires some discipline but speeds up development;
    • each module has its own public contract by which it is known to the other modules; implementation details are private, they can’t be coupled to and are treated as black-box;
    • Public contract can include: published events (events are a part of domain language), public golang service interfaces and http endpoints; there also are behavioral contracts setting up expectations on how these work together;
    • in the code each golang package is supposed to have an implementation of the following interface, that’s how it is wired to the system; all module dependencies are passed into the constructor without any magic.
    type Module interface {    // Register this module in execution context    Register(h Context)  }    type Context interface {    // AddAuthHttp wires a handler for authenticated context which    // will be called when request is dispatched to the specified path    AddAuthHttp(path string, handler web.Handler)      // AddHttpHandler wires raw http.Handler to handle unauthenticated    // requests    AddHttpHandler(path string, handler http.Handler)    // RegisterEventHandler links directly to the bus    RegisterEventHandler(h bus.NodeHandler)    // ResetData wipes and resets all storage for testing purposes    ResetData(reset func())  }

    Getting the notion of modules right is extremely important for us, since it is one of the principles behind our design process. We think, structure our work and plan in terms of modules.

    For the upcoming week we plan to:

    • Cleanup the codebase (one module at a time), finishing the alignment to RabbitMQ;
    • Capture and discuss next HPC features to be implemented (summer vacations are coming and we want to prepare work so that we could continue moving forward even when the rest of the distributed team is offline, taking motorcycle classes or hiking to the top of Mount Elbrus); this will add more stand-alone modules;
    • Start writing two-phase data transformation tooling to export data from the current version of HappyPancake and transfer it into the event-driven version; this would allow to validate the design of existing modules and stress-test the system.

    PS: Why Emacs is better than Vim?

    Over my entire life I’ve been searching for a sensible way to handle tasks and activities, both everyday and job-related. Tooling ranged from MS Project Server (for large projects and distributed teams) to OmniFocus (personal todo lists).

    Earlier this year I discovered org-mode - a set of Emacs extensions for managing notes and tasks in text files. That was the reason for switching to Emacs from sublime.

    Recently I caught myself managing some small tasks and notes of HPC project via org-mode as well. All hpc-related information is stored in a textual hpc.org file kept in the repository with the source code.

    Anybody could read or even edit this file.

    Emacs, of course, provides more benefits on top of that mode:

    • ability to view and manage entries from all org-modes on my machine;
    • capturing new tasks with a few key strokes;
    • powerful agenda and scheduling capabilities;
    • exports to a lot of formats;
    • auto-completion, tags, categories, outlining, refiling, filtering etc.

    For example, here is overview of my agenda, filtered by hpc tasks:

    I think, I got Pieter thinking about giving a try to Emacs, since Vim does not have org-mode (or a decent port).

    July 6 - Distributing Work

    A season of vacations starts. This week was the last time when our team was online at the same time. Tomas takes a vacation starting from the next week. Pieter is probably going to take his as soon as he gets through bike exams (wishing him the best of luck). I’ll travel to Georgia next week, while working remotely and taking longer weekends.

    Obviously, we want to stay productive during this period and move forward on our project. There are things that usually require full consensus: important decisions about design, specific feature requirements, everything that involves multiple packages at once. Last week was spent going through these things in advance to make sure we have plenty of non-blocking work queued up for the next month.

    More Features

    We have some basic features implemented in the system so far. Software design evolved a bit to support them all while keeping things simple.

    At this point, if HappyPancake2 were a brand-new product, I’d recommend going live (e.g. in stealth mode) as soon as possible in order to start getting real-world feedback from the users.

    No amount of testers and visionaries can replace knowledge and insights coming from the real world feedback. Duty of software developers is to make his happen as soon as possible and then iterate, incorporating lessons learned.

    However, HappyPancake2 is special - it is already used by thousands of users, so there is already plenty of feedback. We know quite well which features are necessary, which could be discarded and which enhancements we could try next.

    Hence we can keep on working on this project without releasing it. Tomas has all the domain knowledge we need right now.

    We are planning to introduce these features next:

    • Interests - tags that members can add to their profile, allowing other people to find them by interests (and potentially allowing us to provide better matching);
    • blocking - allowing a member to ignore another one (removing him or her from all search results and blocking communications);
    • online list;
    • abuse reports on content with admin review queues;
    • favorite profiles.

    During the week Pieter focused his efforts on developing review functionality, which is one of the most important features in our system.

    Node.js

    We are planning to make a slight tech change in our stack by implementing front-end in node.js (which is something Tomas explored last week). This is a relatively small change to the existing system - http endpoints will need to return JSON instead of rendered HTML, so the cost is relatively low. Benefits are:

    • better separation of concerns in our design;
    • ability to use Rendr (render backbone.js apps on the client and the server).

    This would turn our existing code into back-end with an API, serving JSON requests and streams to node.js front-end. Such separation allows to have more flexibility in UI while introducing a much better testing to the back-end.

    Behavior Testing

    Thanks to the switch from HTML endpoints to JSON, I started introducing package behavior tests to our system last week. These tests set and verify expectations about public contracts exposed by packages. This is quite simple to do:

    1. Given a set of events and dependencies
    2. When we execute an action (usually calling a JSON endpoint)
    3. Expect certain assertions to be true.

    In the longer term I hope to convert these tests to self-documenting expectations (like I did in my previous .NET projects). Ability to have up-to-date documentation of the code that is expressed in human-readable language can be a powerful thing for keeping project stake-holders involved in the project. This means better feedback and faster iterations.

    Code looks like this in golang:

    func (x *context) Test_given_nancy_flirts_bob_when_GET_bobs_alerts(c *C) {      s := run_nancy_flirts_bob(x)        r := x.GetJson(s.bobId, "/alerts")      c.Assert(r.Code, Equals, http.StatusOK)        var m model      r.Unmarhal(&m)      c.Check(m.Title, Equals, "Alerts")      c.Check(m.HasMore, Equals, false)        c.Assert(m.Items, HasLen, 1)      i1 := m.Items[0]        c.Check(i1.Member.Nickname, Equals, "nancy")      c.Check(i1.Unread, Equals, true)      c.Check(i1.Member.IsOnline, Equals, true) // since we have allOnline        c.Check(x.Service.AnyUnread(s.bobId), Equals, false)  }

    where nancy flirts bob scenario is a simple code setting up preconditions on the system:

    func run_nancy_flirts_bob(x *context) (info *nancy_flirts_bob) {      info = &nancy_flirts_bob{hpc.NewId(), hpc.NewId()}      x.Dispatch(hpc.NewRegistrationApproved(          hpc.NewId(),          info.bobId,          "bob",          hpc.Male,          hpc.NewBirthday(time.Now().AddDate(-23, 0, 0)),          "email",          hpc.NoPortraitMale))        x.Dispatch(hpc.NewRegistrationApproved(          hpc.NewId(),          info.nancyId,          "nancy",          hpc.Female,          hpc.NewBirthday(time.Now().AddDate(-22, 0, 0)),          "email",          hpc.NoPortraitFemale))        x.Dispatch(&hpc.FlirtSent{hpc.NewId(), info.nancyId, info.bobId})      return  }

    The Truth is Born in Argument

    I can’t be grateful enough to Pieter who has enough patience to go with me through the design discussions in cases when we disagree about something. Talking things through with him is one of the reasons why our design stays simple, clear and capable of future evolution.

    Design Game

    A lot of our work resembles some sort of puzzle, where we have to do 3 things:

    • find names and words that let us communicate better (we are a distributed team from different countries);
    • discover ways to break down large problem into small coherent parts (team is too small to be able to tackle huge problems);
    • decide on optimal order in which these parts could be handled (our time is limited and has to be applied to the areas where it will make the biggest impact for the project).

    The hardest part is deciding which things have to be done right now and which can be deferred till some point in the future. In some cases implementing a feature without all the necessary data at hand can be a waste of time, in other cases, this could lead to a deeper insight required to move forward.

    We try to optimize implementation chain a lot - bringing most rewarding and easy features (“low hanging fruites”) and depreriotizing ones that are less beneficial for the project. That is an ongoing process required for applying our limited time most efficiently.

    For example, previously we pretended to store photos in our system. We simply passed around urls pointing to photos from the original version of HappyPancake. That was a good decision (defer functionality as long as possible), but time came to implement it.

    During last weeks Pieter pushed new media module and spent some time integrating it with the other our services. This brought new insights to how we are going pass around this information through events. We also know how we could host and scale such such module in production (deploy to multiple nodes and rsync between them).

    Anything related to performance is another example of things we deferred.

    “Big Data”

    So far our development intentionally focused on software design while deferring any potential performance optimizations. Now it is time to start learning about the actual numbers and real-world usage.

    At the end of the week I went back to Visual Studio to start writing an extractor utility. This tool merely connects to the original database and saves some data to a compact binary representation (compressed stream of protobuf entities). Then, I started working on the golang code which will scan through that data, producing a stream of events which could be passed to our development project.

    It is recommended to use such two-step data processing (dump data store to intermediary format and then iterate on data dumps) whenever you are working with large datasets coming from a live system. This decouples the process from production systems, reducing the impact and allowing to have faster iterations.

    We started working on Finland database, which is one of our smaller installations, yet the already is a bit of data to process. For example, there are more than 1200000 messages, taking 230MB in compressed binary form, 11MB of member data and 2MB of user flirts. Sweden is 100-50 times larger than that.

    This might seem like a lot of data, however it is not so. Our entire Sweden dataset, if encoded properly, could fit on a single smart-phone and be processed by it. It’s just large. However, since we didn’t introduce many performance considerations into our design yet (aside from keeping it scalable), some tuning will be necessary.

    I haven’t worked with large datasets for more than half a year, so I’m really looking forward to get back in this field. Real-time reactive nature of the data makes this even more interesting and exciting.

    July 21 - Smarter Development

    Shorter Feedback Loop

    We have a continuous integration server responsible for running tests on code pushed to the repository. It is more diligent than humans and always runs all tests. However, in order to see build failures one had to visit a build page (which didn’t happen frequently). Our builds were broken most of the time.

    I tried to fix that by plugging build server directly to our main chat. All failures and successes are reported immediately. Build stays green most of the time.

    Working with Finland

    I spent time trying different strategies to populate our system with Finland dataset. This population happens by generating events from raw data dump and dispatching them to our system. Currently we generate only a subset of events, but that already is more than 1000000 of them, sent at once. If we can handle that, then we stand a chance against Sweden dataset.

    I focused on news module, which has one of the most complicated and time-consuming denormalization logic:

    • each member has his own newsfeed;
    • each member has an interest in some people (e.g. in females with age between 25 and 30 and living in city X);
    • newsfeed is populated with events coming from other members which are interesting to this member;
    • new members by default will have an empty newsfeed, we need to back-fill it with some recent events from interesting members;
    • if member blocks another member, then events from the blocked member will no longer show up in a newsfeed, existing events have to be removed.

    My initial implementation of news module was handling events at an astonishing speed of 2-10 events per second. I spend multiple days learning inner workings of our stack and looking for simple ways to improve the performance.

    StatsD and EXPLAIN ANALYSE from PostgreSQL helped a lot to reach speed of 200-400 events per second.

    Solution was:

    • push all event denormalization to PostgreSQL server (fewer roundtrips);
    • handle each event within an explicit transaction (no need to commit between steps within the event handling);
    • rewrite queries till they are fast.

    So far the performance is quite good so we don’t need to bother too much about pushing it further so far. Adding more features is the most important thing now.

    Control is important

    It is really important for members of our dating web site to know that they are in the control. They should be easily able to block out any unwanted communications. That’s why we have block feature - ability to put another member into an ignore list, effectively filtering him out from all news feeds, conversations and any other lists.

    I started working on that interesting feature only to realize that it has a lot of implications. Each other module would react differently to the fact that a user is being blocked.

    We need to somehow keep a track of all these requirements. Preferably it will be not in a form of the document, since documents get stale and outdated really fast (keeping them fresh requires time and concentration which could also be spent developing new features). Ideally, these requirements could also be verified automatically.

    Improving Use Cases

    I invested some time to improve our module BDD tests, transforming them into proper use-cases. These use-cases:

    • are still written in golang;
    • are executed with unit tests (and provide detailed explanation in case of failure);
    • can be printed out as readable stories;
    • can be printed as a high-level overview of the system.

    Of course, these stories aren’t readable by absolutely everybody. That’s not the point. Their purpose is:

    • Give sense of accomplishment for developers encouraging them to write tested code (me and Pieter);
    • align tests with expectations from the system (help to make sure that we are testing what is important);
    • provide a quick up-to-date documentation of the API and scenarios for other developers who would be working with the system (Tomas);
    • express behaviors of the system in a way that is not tied to any language (e.g.: Tomas will not need to dive into the golang code in order to consume API from node.js).

    The best part is that these nice stories are generated automatically from the code. No extra effort is required (apart from writing a small rendering logic in golang while riding on a bus in Adjaria).

    With this approach it becomes simpler to have high-level overview of what is already done in the system and what has to be done. Simply list names of all passing use cases per module and you have that kind of overview. Other interesting transformations are also possible (i.e.: dependency graphs between modules, event causality patterns etc). They all provide additional insight into the domain model, allowing to have greater insight into the code we write and maintain its integrity.

    It is quite possible that we could completely discard this code once we hit the production. Need to maintain the integrity of domain model will be replaced by different forces by then.

    My plans for the upcoming days are to keep covering our existing functionality with these use-cases and adding new functionality.

    July 29 - Delivering Features and Tests

    Last week was quite productive and exciting.

    Introduction of use cases into our development cycle worked out really well, helping to deliver tangible features in the absence of tangible UI to target (node.js front-end development is paused till Tomas gets back from the vacation).

    These use cases so far:

    • serve as high-level behavior tests aligned with the domain language (compact and non-fragile);
    • drive development towards a better event-driven design;
    • produce nice human-readable documentation, as a side-effect;
    • provide really fast feedback cycle.

    Actually, these use cases are the design. We can probably take them and rewrite the entire system in a completely different language in 2 weeks. And we can do that without loosing any features or scalability options.

    However, these nice benefits are not as important as the fact the wespent last week developing new features and improving code coverage, while really enjoying the process.

    Pieter jumped right into the game, picking up on use case development and extending the framework to support edge-cases which I missed (e.g.: testing file uploads or using real HTTP server to allow inspecting raw HTTP requests with WireShark). He already covereddrafts module with use cases.

    Pieter also invested time last week cleaning things across the code-base.

    I spent the last week both adding use cases (coverage of chat,alerts, news, poll), fixing bugs revealed by them and adding proper handling of member-blocked and member-unblocked across the system.

    As of now, we have 33 use cases covering 15 API calls. We know this number exactly, because of a tiny little helper command summary which can print out information about all use cases.

    With that command (and the power of BASH), one can easily answer questions like:

    • How many use cases are in the system?
    • Which URIs are not covered by the tests?
    • Which events are published or consumed by module X?
    • What are the dependencies between the modules?
    • Which events are not covered by any use case?

    This self-building knowledge about the system is another reason which makes writing use cases so rewarding.

    I also took a bite and tweaked our build server to include commit summaries in chat messages posted to Slack. This way, it is easier to observe team progress without going to git repository. This also encourages frequent pushes, since drone picks up only the latest commit in a push.

    This week I’m going to continue delivering features, covering them with more use cases and also working on the ETL code to extract data from HPCv1 into our new system.

    August 2 - Data, Use Cases And New Module

    Last week I was simply developing in a pleasing and steady way:

    • alerts - clean JSON API and more use cases to verify its behavior;
    • diary - clean JSON API, more use cases and support of member blocking;
    • chat - more use cases;
    • like - clean JSON API and use cases;
    • favorite - implemented full module, including JSON API, major use-cases, etl and seeding.

    Data Extraction from v1

    I spent some more quality time with .NET/C# last week, adding more data types to our script responsible for graceful data extraction from HPCv1 databases into compact binary representation (GZIPped stream of length-prefixed binary-encoded objects). This representation works very well for further data processing even at the scale of terabytes.

    So far I extracted data from all of the largest tables in Finland database, writing event generation code for all matching modules and smoke-testing them on glesys. HPCv2 handles that data pretty well, although RabbitMQ gets a little strained while handling 1500000 messages and copying them into a dozen queues. We’ll probably need to optimize our message routing strategy a little here.

    Fortunately, we can simply reuse data from our wonderful use case suite.

    I’ll be on a vacation next week, so we tried to reproduce process of data retrieval (from binary dumps) and event seeding on the machine ofPieter. It worked without issues.

    Use Cases

    We are slowly falling in love with use case approach in the codebase of HPCv2. Writing them is a pleasure, and they actually speed up and simplify the development. At the moment of writing we have 50 of them, verifying different behaviors of JSON API for the frontend thatTomas will be working on when he gets back.

    I added ability to render use cases into a dependency graph, helping to see development results from a different perspective. Visual representation allows your brain to understand code from a different perspective, making it easier to spot new dependencies or gaps in the code. It is easier to communicate, too.

    For example, while developing favorite module from scratch, its graph looked like this:

    Later that day, when the module was complete and covered with 7 use cases, it looked like this:

    This graph is auto-generated from the code via this process:

    1. Load a specific module (or all of them), inspecting registrations in the process.
    2. Inspect all use cases for input events, HTTP requests and output events. We can do that because each use case is simply a data structure, describing: GIVEN previous events, WHEN we call API endpoint, THEN expect API result and 0 or more events.
    3. Print resulting data model in a dot file format for graphviz program to render.

    Of course, if some dependency is not covered by a use case, then it is not visible on such graph. I consider this to be a good feature, since it encourages me to cover all important cases.

    Of course, it is possible to graph all modules and their dependencies. That map would be useful for spotting some loose ends or old dependencies.

    Pieter was busy cleaning up the overall codebase, working on the implementation of draft, review and getting read of some obsolete logic.

    Next week I will be completely on a vacation, spending time around the Elbrus mountain. If there are any free periods of time (e.g. long bus rides), I’d love to clean up the profile module, adding a clean JSON API to it.

    Living Documentation

    Tomas is coming back from the vacation that week. He’ll probably get back to the front-end development on top of our new JSON API. When he does, he can see living documentation for out current system.

    First, run ./r story to see all use cases.

    Second, run ./r summary to see grep-able metadata about the system (mostly derived from the use cases).

    Third, run ./r graph | dot -Tpng > graph.png to create dot file for the system and then feed it to grapviz.

    Of course, each output can be altered with other programs like grep to filter, group and aggregate information in interesting ways.

    This kind of documentation always stays up-to-date and does not need any effort to maintain.

    August 16 - Back from the Vacation

    It is good to be back from the vacation. Not only you feel rested, but you also get to see all the cool things done by the team.

    Tomas and Pieter focused on pushing forward seeding utility that takes data dumps from the production systems and transforms it into the events for the second version of HappyPancake. They moved beyond Finland dataset (the smallest one) and started seeding Sweden, which yields more than 500.000.000 events. This allowed to polish code heavily (e.g. memory and connection leaks are detected early this days) with real-world data.

    I focused on cleanups and cross-cutting concerns this week. Removingmember module helped in making modules more decoupled (now almost all data is denormalized by modules in their own storages).

    Then, to push this even further, I physically separated modules from each other, giving each module a separate database for production and tests. This is a big win for us, since this allows to replace one big and expensive SQL server (running it currently in production) with a bunch of smaller servers, that cost less and are easier to scale.

    This improvement required introduction of module specification - a data structure describing some static qualities of the module, which are known even before it is registered in the infrastructure. Such specification currently includes:

    • module name (also acting as DB name, if it has one);
    • module DB schema (SQL scripts to create and reset DB contents);
    • use cases that describe behaviors of the module.

    With such information, we can create databases while initializing environment and pass connections to these databases to modules on registration. This also allows to run use case verification in separate temporary databases, which are destroyed at the end of test run.

    With improvements from Pieter our auto-generated module diagrams became even more helpful. They give additional perspective into the code, allowing to see missing or odd pieces. For example, here is diagram of the chat module:

    As you can see, POST /chat/poll is marked as read, since it is implemented but not covered with any single use case. This endpoint serves data to legacy UI served directly by the back-end (to be removed). It is to be removed, hence there was no point in testing it. Red marker serves as concise reminder of that.

    Same goes for member-typing and member-read-thread domain events, which are subscribed to but are never really used (in a way that is verified by use cases). This is also something that we will need to cleanup, once focus shifts back to the UI.

    Next week we plan to decide on the road map for implementing our UI. Currently it is served as HTML directly by our back-end, which is not a good thing (modules get complicated). Possible options include:

    • move html generation with all the resources into a separate front-end module (or a collection of them);
    • kill existing UI code and reimplement everything with node.js and rendr;
    • find some other alternatives.

    In the longer term we want to have rich single-page application that offers smooth experience to our users (feeling almost like a native application). However, implementing it right now might be a long and risky shot. It seems to be safer to capture existing requirements in a simple User Interface (building upon the code that we already have but with better application design) and deliver that to production. We could always improve later.

    Besides user interface, there also is some more back end work to do, like:

    • implementing albums module (public and private photo albums);
    • implementing tags and categories for the profiles;
    • improving performance of the system to let it process Sweden dataset faster;
    • figure out profile properties.

    I personally look forward to get back to work on the front end, since it is a part of critical path (in terms of project management) to our delivery. Earlier this week I started reading the book onSingle Page Web Applications only to discover that web app development these days strongly resembles development of desktop applications. Aside from the specifics of HTML, CSS, JS and various browsers, design principles and patterns are quite similar. It should be fun.

    To be continued…

    This story of HappyPancake is still a work in progress. I try to post weekly updates to R&D blog of HappyPancake later incorporating them into this post.