Data Infrastructure at Uniplaces

Miguel Loureiro
8 min readOct 5, 2017

Finding your single source of truth with AWS applying the Lambda Architecture

At Uniplaces we’re focused in letting data drive company decisions, meaning that, we need… data. In this article, we’ll share what we believe to be a common issue when growing a startup having data in mind and how we’ve solved it in order to scale and provide a single source of truth to the company.

A typical scenario

After speaking with a lot of people in my network, working in the tech startup industry, I found a common scenario that haunts most of them:

  1. You have a business and you start building your product
  2. You find it’s easier to start with a monolith backend that’ll cover your business requirements picking whatever-driven-development
  3. You want to persist your operational data in a NoSQL database that will represent your application state
  4. Great, we’re live
  5. Other teams start making questions. “Give me metrics for X and Z”, “I need to visualize data in whatever-bi-analytics-tool and it only supports SQL”, bla bla…
  6. Shit… let’s do an ETL to a reporting database
  7. Ok cool, people can get their own answers from the database
  8. Shit… the reporting database is always breaking due to the constant changes in your domain model and how you project it to your operational database.

Guess what? It was also haunting us!

A weak reporting database makes you lose too much time in maintenance, gives you no confidence on the data you’re analyzing, and the worst thing, makes other teams find different ways to get the answers they need, consequently leading to an anti-pattern of having multiple sources of truth when it’s imperative to have one.

Understanding the issues and finding a solution

Before jumping into a solution we wanted to pinpoint the main issues we needed to solve. Here’s a list:

  • Engineers were taking too long fixing small issues with the reporting database
  • Our domain model was tightly coupled to our data infrastructure due to the ETL’s from the operational database (DynamoDB) to our reporting database (MySQL)
  • ETL’s took too long
  • If there were any human errors in the code it would give us wrong data and there was no easy way to get the correct data again
  • Schema changes weren’t very nice to handle, because different stakeholders have different business questions

So, the four main goals were to:

  • Reduce engineering effort
  • Have a single source of truth
  • Enable Data Analysts from different teams to have ownership on the reporting database, providing the correct tool set to create the views they need to answer different business questions
  • Scale, not only for Data Analysts, but also for Data Scientists

Deeply inspired in the Lambda Architecture referenced in Big Data: Principles and best practices of scalable realtime data systems by Nathan Marz and James Warren we decided to aim towards our goals with these principles in mind.

How we’ve applied the Lambda Architecture to build our new Data Infrastructure

We will not go deep in what the Lambda Architecture is all about because there are a ton of resources in the Internet and excellent books available, as I’ve mentioned previously. What we were struggling to find, when figuring out how to implement the architecture, were resources on the Internet with actual implementations. That’s why we are sharing our approach. Note that we didn’t apply everything by the book, since we don’t feel we need all the layers it provides (more specifically the speed layer) for now.

The goal here is to give you a walk through on how we’ve built the Batch and Service layer, providing you insights on the benefits of the implementation we’re making. Below here’s an overview of our data pipeline.

Overview of the architecture and the services we’re using
  1. Sending events

Let’s consider we have an entity called Accommodation Provider and we want to have data related to it. I’ll give the examples in PHP so that I can easily copy/paste some parts of the code instead of writing, just for the sake of this article.

An Accommodation Provider domain object is described below:

And the corresponding object serialization to DynamoDB

We can see that our entity has one Value Object called BasicInfo that holds the name and email of one Accommodation Provider. We want to send information to our Master dataset and for that we need to understand how to model it having scalability and flexibility to change it in mind..

Immutable data and the Fact-Based-Model

When looking at how to model our master dataset, it was easy to understand that working with an immutable/append only data model was the best choice for us, being the human-fault tolerance ‘feature’ very attractive for us.

Consider our Accommodation Provider above, let’s imagine Miguel and Pedro updated their emails and that would trigger the process of an ETL in the common Reporting database. The scenario would be something described in the picture.

With an immutable data model the approach is different, instead of having a row for each accommodation provider, we’re storing each single atomic fact.

In an immutable data model each piece of data is true in perpetuity. Once true, it must always be true.

In the end you’re keeping the entire history of your data, storing one fact per record, which makes it easier to handle partial information about an entity without introducing NULL values in your master dataset. It also supports the evolution of your data, because company data types change considerably, even more in a startup! It’s human-fault tolerant and people will make mistakes! With an immutable data model, no data can be lost. If bad data is written, earlier (good) data units still exist. Fixing things is just a matter of recomputing the views built from the master dataset.

Let’s now see how we’re doing it!

Every time there is an update of an Accommodation Provider’s BasicInfo, we have an event listener responsible to handle that domain event and send the information to AWS Kinesis Firehose.

Please read the comments carefully in order to understand. Note that, since we’re not sending atomic facts to the delivery stream, we need to do some transformations from the payload before going to the Kinesis delivery stream and the toolset around Kinesis is great!

2. Handling events

We’re using AWS Lambda functions (the name has nothing to do with Lambda Architecture) with NodeJS, that integrate seamlessly with Kinesis to do those transformations. To handle Lambda functions deploys, we’re using Apex so now we just need to care about the code that’ll provide us the atomic facts from the payload.

Let’s first look at the test (using Jest) to easily understand what’s going on.

With tests passing, we just need to code the actual implementation to later hook it with the AWS Firehose delivery stream.

After deploying the Lambda function we’re all set to hook it with Kinesis Firehose. The good thing Kinesis Firehose has is that it’s able to buffer the amount of information you want in order to do batch writes instead of constantly writing in each event, scaling write intensive operations. But, where are we writing our master dataset ? Two candidates were HDFS and S3 being the last one our choice, since we wanted to avoid the whole bottleneck of managing HDFS by ourselves and S3 is also a distributed file system with most of the needs we require from HDFS. Even better, Firehose and Kinesis already handle vertical partition out of the box, so for us, the choice was obvious.

After hooking the Lambda function to the delivery stream, we now just need to output the data to our destinies. We chose to have our master dataset on S3 and then figure out how we could put an SQL interface in front of it for our analysts.

PrestoDB could be a choice, but Kinesis Firehose allows us to very easily integrate with Redshift, which will also avoid the bottleneck of managing a cluster for yet another service.

We decided to have Firehose writing to both S3 and Redshift where analysts could easily start computing their batch views while keeping everything on S3, in case our data scientists wanted to pick other tools.

And now, how do we join all the dots ?

3. Storing data in the master dataset

By looking at Kinesis Firehose configuration, the integration with Redshift is nothing more than a COPY command from S3 to Redshift, two important things we need to configure are the COPY command and to map the unstructured data that we have on S3 to a Redshift table.

Let’s first create a Redshift table

Now we need to configure our COPY command

And finally, we need to create a JSON Path file to map our JSON data to the ap_domain_profile_events table, then send it to S3 and later include it in the COPY command options.

The final configuration below ( an image is worth a thousand words :) )

And this is how we’re implementing our batch layer: every time the buffer is full with data to write, Kinesis Firehose will write to S3 and run a COPY command to Redshift.

4. Preparing the data to be used by generating computed views

Now that we have data being stored in our master dataset, we need to make it usable. By looking at tables with facts, it’s obvious that it’ll be hard to work with it and it won’t be performant.

We need to compute a View that will aggregate and represent a snapshot of the current state of our Accommodation Providers. Example below:

By aggregating the information available on the facts, when running the query we can get a View that represents the current snapshot of Accommodation Providers.

The result of this View can be used wherever you want. You can export the result to a different MySQL database and index whatever you want and fine tune for performance in your Service Layer.

This was the way we found to improve our processes and data reliability.

  • We now have a clear separation of responsibilities, where engineers are accountable for sending events to the Data pipeline through Kinesis, whereas the Data team can work on the transformations they need to create Facts and populate the master dataset
  • Analysts can build their own views from the master dataset and have them being served on the Service Layer
  • By using immutable data and having an append only system we never lose data and we’re ready for human errors
  • If the transformation or some engineer sends incompatible data to the transformation process, Firehose allows us to re-run the failed events and add them to our Data pipeline when the code is fixed
  • Our master dataset is ready to work either with Redshift and Hadoop/Spark if Data scientists prefer to work with a different set of tools.

I would really recommend reading about the Lambda Architecture in order to understand better what’s this all about. Our goal was mainly to show how we’ve implemented it otherwise it would be a super extensive article.

Hope this article gives you some hints and a different perspective on how you can do it!

--

--

Miguel Loureiro

Product & Technology, entrepreneur, early-stage investor and advisor, occasional blogger