README.md 2.85 KB
Newer Older
1 2 3 4 5 6
# Terraform for the AWS SnowPlow Pipeline

This configuration uses the following AWS services to host SnowPlow. There
may be more in use, but these are the primary services.
1. EC2 (Auto Scaling Groups, Launch Configurations, ELB, Target Groups,
  Security Groups)
7 8
1. Kinesis (Streams and Firehose)
1. Lambda
9 10 11 12 13 14
1. DynamoDB
1. IAM (Policies and Roles)
1. S3
1. VPC (Subnets, VPC, Internet Gateways, Routes, Routing Tables)

## Design Document
15 16
If you want to know more about the SnowPlow infrastructure, please consult the
[design document](https://about.gitlab.com/handbook/engineering/infrastructure/design/snowplow/).
17 18

## SnowPlow Installs and Configs
19 20
There are two types of SnowPlow nodes (Collectors and Enrichers) and they are
all configured and installed via user-data in the launch configurations.
21 22 23 24 25 26 27

## Kinesis Streams
Kinesis is how SnowPlow hands off data from collector to enricher to s3loader.
* snowplow-raw-good
* snowplow-raw-bad
* snowplow-enriched-good
* snowplow-enriched-bad
28 29 30 31

## Kinesis Firehose and Lambda
Kinesis Firehose takes events from a stream and applies a Lambda function
to the event, then write it into the S3 bucket.
32

33 34 35 36 37 38 39 40 41
## Lambda Function
Firehose uses a Lambda function to format events written to S3. The Lambda
function code is in the file ```lambda/lambda_function.py```. As of this
writing, all this function does is append a newline to the end of each event
before it is written to S3.

The AWS provider for Terraform requires a zip file of this code to update or
create the Lambda function. There is a data archive object in the config with
the name ```snowplow_lambda_event_formatter_archive``` that build the zip file
42 43 44 45 46 47
from the function python script. For now, the zip contains a single file (the
lambda_function.py file) with no directory structure.

If the hash of that file changes, terraform will try to update the function.
It's possible that the zip file hash chances, but no code changes were made.
This is ok to replace on the fly in the Lambda config.
48

49 50 51 52 53 54 55 56 57 58
## DynamoDB
The enricher and s3loader nodes use DynamoDB to track Kinesis state. Normally
these tables would be allocated by Terraform, but if the nodes themselves don't
create the tables, it did not seem to work properly. Therefore, access to the
tables is controlled by roles and policies, but the tables are managed by the
SnowPlow nodes that need them. If the table needs to be created, the nodes will
do that on their own.
* SnowplowEnrich-gitlab-us-east-1
* SnowplowS3Loader-gitlab-us-east-1

59 60 61 62 63
## Launch Config Changes and Production Instances
Updating the launch config will apply to new systems coming up in the
auto-scaling group. But existing EC2 instances won't be changed. You will
have to rotate them manually to have them replaced.

64
### SSL Certificate for Load Balancer
65 66
This is referenced as an ARN to the cert in AWS. We're not going to put the
private key in TF, so this will have to remain as an ARN reference.