title | description | layout | category | cSpell |
---|---|---|---|---|
GitLab Environment Deploys |
How to use GitLab to deploy your sandbox IdP environments |
article |
Platform |
ignore Vogon |
Login.gov needs automation. We finally have acquired GitLab and are able to use it for deploying idp environments! This document will tell you how the environment deploy process works, how to set up your environment so that it is deployed by gitlab (skip ahead to "Adding new application sandbox environments" if you are impatient and just want to get it going), and give some operational tips on how to debug issues.
As code gets merged into main
, a pipeline will build env_deploy
and env_test
docker images, save them to ECR, and
tag them with latest
and the git SHA of the code that was used to create
the image. This image will be available for us to hardcode into our
environments for deploys/tests.
As you check
in code to your stages/<environment>
branch, GitLab will notice and
kick off a pipeline. It will run a deploy job that is tagged for
<environment>-env-runner
. The env_deploy
image contains all of
the terraform binaries/providers/etc that it needs to do a self-contained
deployment, so we no longer need to rely on a terraform bundle like we
did with auto-tf, and so we do not require internet access to run our
tooling at runtime, which adds to security and reliability.
<environment>-env-runner
is a special runner that runs inside of your
environment that is empowered with all the terraform powers that we use
for creating infrastructure. It only can run a hardcoded env_deploy
image, which runs terraform on the code that gitlab supplies. The runner
also is constrained from accessing outside resources through security
groups and a dedicated env_runner
proxy.
For forensic purposes, the terraform plan and plan output are stored as an artifact in S3, and can be retrieved through the gitlab UI or directly in S3 if you know the build number.
After that, tests are run against your environment using the env_test
image
on the env_runner. The test results are translated
to junit output and uploaded to gitlab as an artifact so that test
results are available in the pipeline.
We hope to tie these pipelines together so that when new code is introduced in certain branches, it will be deployed to a staging or CI environment or something, tested, and if successful, go on to deploy to the final environment.
Ideally, if we merge code to main, it should automatically roll out after being properly tested. Right now, our tests are a bit too minimal for that to happen, but as we add to them, our confidence will become high enough to start doing this more and more.
This is a tremendously powerful tool. Once fully deployed, this system will be able to change/delete pretty much everything in the whole Login.gov world. As such, we need to be SUPER careful about the tools used here, how they are used, etc. This system should be as simple and self-contained as possible, with as few permissions granted as we can make it.
Once this tool is working well and people trust it, we should be able to start removing IAM permissions from most people, which will improve our security posture.
The IAM permissions and the role that is assumed when running terraform can be found
in terraform/modules/gitlab_runner_pool/iam-role.tf
, though truthfully,
the most important ones are attached to the role are the AutoTerraform
policies, which can be found in terraform/all/module/iam-role-terraform.tf
.
These powers are pretty extensive already, but try to not expand them unless you need it.
Since Terraform requires a lot of AWS endpoints, we did as many as possible as VPC endpoints that we explicitly grant access to with security groups. But not all AWS services can be put in VPC endpoints, so we had to plug in an outbound proxy which allows a limited set of services that the deploy process can access. This list can be expanded if we need to, but try not to do this. Other than these explicit allows, the runner is denied access to everything with security groups.
To limit runtime external connectivity requirements, we build the image that runs terraform to have all required tooling and plugins on it already. This should insulate us from supply chain attacks and from external services being down.
We are hardcoding what exact container image the runner is able to run, as well as disabling the ability for gitlab to override the entrypoint of the container, so that somebody who compromises gitlab will be unable to run arbitrary jobs on the env runners. They will only be able to run terraform in our constrained way.
We are also disabling services and root access for docker, as well as following the standard docker hardening procedures. There are also checks to make sure that you aren't trying to be tricky and do a deploy to the wrong environment.
You can find the pipelines if you go to https://gitlab.login.gov/lg/identity-devops/-/pipelines.
It's pretty easy to look at these things and click on cancel/retry/etc, but there are lots of docs on the gitlab site
The easiest way to see what is going on is to click into the pipeline you are interested in and then click into the particular stage. You should be able to see the logs of the particular step in a web UI.
The terraform binary is configured in identity-devops/dockerfiles/env_deploy.Dockerfile
.
Edit TF_VERSION
and TF_SHA256
to change what version it is. You can find the versions
and SHA256es of these binaries by browsing around under https://releases.hashicorp.com/terraform/.
You will probably want to test your stuff out first, so make a branch and make your changes
to the Dockerfile(s) and env_deploy.sh
scripts and so on in that branch, then add your
branch to the only:
section for the build_env_deploy
section in your .github-ci.yml
on your branch. Once you check it in, your branch will build the image for your branch.
There are two ways to test this out in your environment:
-
The quick-n-dirty way. Log into your env_runner instance in your env and edit
/etc/gitlab-runner/config.toml
. Comment out theallowed_images
line. restart gitlab runner withsystemctl restart gitlab-runner
. Your runner will now allow all images to run on it, which is not secure, but it'll increase your iteration speed because you don't need to recycle your env_runner for every new image you create.After that, change your branch that your env deploys from so that it uses the new image in your
.gitlab-ci.yml
to get it to run a deploy and test the new functionality out.When your new image works and a build completes, your env_runner will be recycled, so it will return to its locked-down state. You can keep it around by commenting out the
post_deploy_<env>
job in.gitlab-ci.yml
and keep cycling if you need to, but BE SURE TO REVERT THIS WHEN DONE. -
The end-to-end way (slow). This is the normal way that systems get set up, so you should probably do this before you do your final PR that has your image in it to make sure that it works end-to-end. Copy
s3://login-gov.secrets.<account>-us-west-2/common/gitlab_env_runner_allowed_images
to either add or replace the image in that file with your new one. You can get the proper digest from either looking at the image build log at the end, or by looking at the ECR console for the image that just got built. Once you have the new image in there, copy that file up tos3://login-gov.secrets.<account>-us-west-2/<env>/gitlab_env_runner_allowed_images
, where<env>
is your environment. Then, go to your environment and recycle the env_runner. It should come up ready to let your new image run.After that, change your branch that your env deploys from so that it uses the new image in your
.gitlab-ci.yml
to get it to run a deploy and test the new functionality out.
Once you've verified it works, you should add the new image to the
s3://login-gov.secrets.<account>-us-west-2/common/gitlab_env_runner_allowed_images
file and maybe remove some old ones, but not if they are being used by any of the
environments. This will let other environments know that it is an approved image
once they relaunch their env_runners. You might also want to remove your
s3://login-gov.secrets.<account>-us-west-2/<env>/gitlab_env_runner_allowed_images
file, so your env_runner gets its stuff from the common secret again.
I know, this is a PITA, but this would be SO much easier if we didn't have to lock down our gitlab runners. They have a crazy amount of power, so we have to make sure that arbitrary jobs/code cannot be run on them.
If you want to add more images to be allowed to run by the env_runner, you can edit the file in:
s3://login-gov.secrets.<account>-us-west-2/<env>/gitlab_env_runner_allowed_images
if you want to override what images are allowed in your own environment.s3://login-gov.secrets.<account>-us-west-2/common/gitlab_env_runner_allowed_images
for what everybody who does not have an env-specific file is allowed.
Be sure to push this updated file to both the tooling-prod and tooling-sandbox common secrets buckets once it is ready for general usage.
There is a particular format that the file should be in. It looks something like this:
000000000000.dkr.ecr.us-west-2.amazonaws.com/cd/XXXXXdate_separatorXXXXX@sha256:ffffffffffffffffffffffffffffffffffffffffffffffffffffffffff061023
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/env_deploy@sha256:feedface1234...
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/env_test@sha256:feedface1234...
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/etc@sha256:feedface1234...
000000000000.dkr.ecr.us-west-2.amazonaws.com/cd/XXXXXdate_separatorXXXXX@sha256:ffffffffffffffffffffffffffffffffffffffffffffffffffffffffff071023
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/env_deploy@sha256:feedface1234...
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/env_test@sha256:feedface1234...
<tooling-prod-account>.dkr.ecr.us-west-2.amazonaws.com/cd/etc@sha256:feedface1234...
Add your new block of images on the end after a date_separator
line with
the date at the end of the line. Remove the topmost block. Unfortunately,
we can't just have one set of images, because some clusters may not be updated
with the latest and greatest, so we need to keep some of the older ones around.
Also: Be aware that the ugly date_seperator
line will go away in the future,
replaced with a comment line that will probably have the date in it, as well
as possibly other info. We just need the code to get pushed out to all the
environments to support this before we can switch over.
You can add plain repo/name:tag
image names, but we strongly recommend
against that, since these might be subverted. It would be much better to
use a specific image, like repo/name@sha256:feedface42...
. You can get the
sha256 digest from the build process or from looking at ECR.
Then, recycle the env_runners that you want to pick the change up.
They should now only allow the images in the file to run on it, and
error out for anything else. You will probably also want to update
the various hashes in the .gitlab-ci.yml
too so that they get used.
Note: You probably don't want to do this, as env_runners should basically only be doing deploys and terratests. They shouldn't require postgres or other services like other CI jobs do. But just in case, here is how.
New services can be added to the env_runner by editing https://github.com/18F/identity-devops/blob/main/kitchen/cookbooks/identity-gitlab/recipes/runner.rb.
Look for where node.run_state['allowed_services']
is set. An empty
list ([]
) means allow all services. A list with an empty string
means allow nothing ([""]
), which is what it ought to be set as.
You can add more images to that list.
To enable your sandbox:
- Create a new branch from main. The convention is
stages/<env>
, even though it technically could be anything. - Make sure that
identity-devops-private/vars/<env>.tfvars
has these variables:Make sure they are set togitlab_enabled = true gitlab_runner_enabled = true
true
. You should have them by default, but if not, copy them fromvars/tspencer.tfvars
- Terraform your environment. It should set all the
privatelink
stuff up and launch the env_runner. - Edit
identity-devops/.gitlab-ci.yml
and copy one of the sets of deploy jobs that are out there. You will need at least the deploy and the test jobs (deploy_tspencer
andtest_tspencer
, for example), but there are others too. You should copy them all. Please be sure to rename them and edit all the fields that have env-specific stuff in them to your env, like change alltspencer
s to<yourenv>
. - Push changes to your branch to trigger gitlab to run your deploy pipeline to your env!
Follow the steps in the "Adding new sandbox environments" section above, but
also create a PR off of master that removes your sandbox from auto-tf,
probably in identity-devops/terraform/tooling/tooling/app_sandboxes.yml
.
Once that PR is merged to main, auto-tf should auto-tf itself and remove
your auto-tf pipeline.
Tests are awesome! Terratest is cool! Let's use Terratest to test stuff!
If there is a tests
dir in the identity-devops directory that the pipeline executes,
the pipeline will cd into it and run ./test.sh
. So tests will automatically
be run after every deployment.
Technically, you can run any commands in test.sh
, but that script should
probably mostly just be used to set up environment variables and give different
arguments to the main testing software, which is Terratest
You can run tests locally with
go test -v
while you are in identity-devops/tests
.
If your tests rely on AWS credentials, you can use aws-vault to run the tests.
We want our tests to be fully contained and not require external resources
to build and run. To that end, we are vendoring the golang dependencies
in the tests
dir.
To update that, just say go mod vendor
, and it should update the
stuff in vendor and let you check it in. All of the normal go module
stuff should work as well, like go get -u
and so on.
You can turn off all pipelines in a repo with this: https://docs.gitlab.com/ee/ci/enable_or_disable_ci.html We probably don't want to do that.
You can turn off the pipeline for a branch by checking in a
.gitlab-ci.yml
file in that branch which has when: manual
in it for the jobs defined that would normally run in it or
some other change to make it not match when gitlab looks for
stuff that should run, like commenting it out or whatever.
If you are working on terraform-only changes, you might want to only deploy the terraform half of the changes, and not recycle the environment. Or you might want to only recycle one type of host instead of everything. We can do that!
There are two variables you can set to change the behavior of gitlab
deploys in the deploy job for your env in the .gitlab-ci.yml
file:
DEPLOY_TERRAFORM_ONLY
: If you set this to anything, it will only terraform your env, and not recycle anything. Tests and so on will still run. Here is an example of how you would use it:deploy_tspencer: <<: *deploy_template variables: GIT_SUBMODULE_STRATEGY: normal DEPLOY_TERRAFORM_ONLY: "true" environment: name: tspencer on_stop: stop_tspencer url: https://idp.tspencer.identitysandbox.gov/ resource_group: tspencer tags: - tspencer-env-runner only: - stages/tspencer
RECYCLE_ONLY_THESE_ASGS
: This should be set to a space-separated list of the ASGs that you want to recycle after doing a terraform run. For example:deploy_tspencer: <<: *deploy_template variables: GIT_SUBMODULE_STRATEGY: normal RECYCLE_ONLY_THESE_ASGS: "tspencer-idp tspencer-worker" environment: name: tspencer on_stop: stop_tspencer url: https://idp.tspencer.identitysandbox.gov/ resource_group: tspencer tags: - tspencer-env-runner only: - stages/tspencer
When you are done doing your targeted testing, it is recommended that you delete the entire variables section so that if the variables in the template are updated, you get the benefits of the templated variables.
Environment runners are funny in that they are sort of not a part of gitlab, but kinda are too because they need to register and talk with gitlab. They are also a bit of a (previously documented above) pain to use because they are locked down with their own proxy and only allowing certain images to run. Thus, they deserve some special discussion.
- Do a deploy to your environment using gitlab. If the
deploy_<env>
or thetest_<env>
job hangs and times out eventually, your change did not allow the env_runner to come up. - Take a look at the list of runners under the admin menu. If there is no
<env>-env-runner
runner alive, then it did not come up after recycling. - Look at the ASG in the AWS console. If it keeps failing to launch, it's sad. Be aware that the mere existence of runners launching doesn't mean that it's working. They have to fully launch without errors.
The env runner has to be able to complete these tasks to come up properly:
- Get the runner registration token from the
gitlab-config
s3 bucket. This requires thegitlab_config_s3_bucket
tag on the instance to be properly set to the proper gitlab config bucket, network access to the s3 endpoint, and permissions to read objects from it. - Be able to talk to the proxy. It needs network access to the the proxy
on the
http://obproxy-env-runner.login.gov.internal:3128
url. - Be able to query the gitlab endpoint through the proxy. The proxy must
allow it in the
/etc/squid/domain-allowlist.conf
file, the endpoint needs to be set up and connected, and it needs network access to that endpoint. The gitlab endpoint islogin-gitlab-<env>
, which is usually accessed over the https://gitlab.login.gov/ URL, if you haven't configured your cluster to go to bravo or one of the other envs.
So here's what you can do to debug those things:
-
Look at the logs as it is booting. The fastest way to do this is to ssm into the host and tail the
/var/log/cloud-init-output.log
file. I use a command like this to get in so that I'm there early in the boot process before it fails:until aws-vault exec sandbox-admin -- bin/ssm-instance --newest asg-tspencer-gitlab-env-runner ; do sleep 2 ; done ; stty sane
You can also probably get these logs through cloudwatch, but it takes a long time for them to show up there, so I never do that.
-
If it hangs before chef gets going, then it's having problems getting to the proxy or maybe general network issues. Look at the network connectivity (security groups, routing, etc.) between the env_runner and the proxy.
-
If it hangs near the start of the identity-devops chef run, then it's probably trying to get the runner token from s3, and it probably is timing out trying to get to the s3 endpoint. Look at the network config or the s3 endpoint config. You can use
aws s3 ls
and other commands to diagnose this a bit. -
If it hangs when trying to register the runner, then it's probably having problems with network connectivity, either between the env_runner and the proxy.
-
If it fails while registering,
^C
quickly and runcat /etc/gitlab-runner/runner-register.sh ; /etc/gitlab-runner/runner-register.sh`
Look at the runner token in the registration command and make sure it's actually a real string, not empty. If it is empty, then there's something wrong with the s3 bucket, like the env_runner doesn't have permissions to read the runner token, or maybe the file somehow got zeroed out.
Also look at the gitlab url that it's trying to register to and make sure it's proper.
Be aware that the host will register it's failure shortly and thus the ASG will nuke the instance, leaving your ssm session hanging. That's why you do the
^C
fast. :-) -
If the token is not empty, then look at what is going on when you run the registration script. If it's hanging or timing out, then look at the network config between the proxy and the gitlab VPC endpoint. If it's giving back a 503, then look at the proxy and make sure it has the gitlab url allowed, and that it can connect to it from the proxy.
-
If everything is set up properly, make sure that the VPC endpoint is working and wired up to the proper gitlab. The easiest way to do that is through the AWS VPC console. It should plug a magic hostname that redirects it to their VPC endpoint in the private dns zone for the cluster. It should auto-register everything, but maybe there can be some problems?
-
Or maybe the gitlab server is messed up! Check it out and make sure it's happy. You can look at the logs in /var/log/gitlab and look for the register API call, and they might give you some indication of the problem, like using the wrong token or I am out of disk space or help I am reading Vogon poetry or whatever. The admin UI might be useful? I don't know because I've not had problems like this, but it could certainly happen.
Have fun!