Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Refactor Dockerfile & entrypoint #8923

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

ci: Refactor Dockerfile & entrypoint #8923

wants to merge 28 commits into from

Conversation

upbqdn
Copy link
Member

@upbqdn upbqdn commented Oct 10, 2024

Motivation & Solution

The current Dockerfile and entrypoint.sh files contain a bunch of bugs. This PR contains the following changes:

  • Create a non-privileged system user in the runtime Docker stage and switch to it.
    • Don't specify UID & GID.
    • Don't specify the home dir for the new system user.
  • Don't use gosu.
  • Remove all packages from the runtime stage.
  • Remove some malfunctioning CI tests.
  • Don't use the EXPOSE instruction in Docker.
  • Bump the Rust version in Dockerfile.
  • Change the location of the entrypoint in Docker images.
  • Remove some redundant env vars.
  • Refactor the structure of the entrypoint.
  • Add docs and TODOs to the entrypoint.

Tests

  • Manually test that zebrad rund under the new zebra user:

    Running

    docker build --file docker/Dockerfile --target runtime --tag zebra:local .
    docker run --detach --name zebra_local zebra:local
    docker exec -it -u root zebra_local bash
    apt-get update && apt-get install -y procps
    ps aux

    displays

    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    zebra          1 86.3  2.7 6605720 2691512 ?     Ssl  09:49  31:03 zebrad -c /etc/zebrad/zebrad.toml
    root         150  0.0  0.0   4188  3368 pts/0    Ss   10:23   0:00 bash
    root         438  0.0  0.0   8088  4044 pts/0    R+   10:25   0:00 ps aux
    

PR Checklist

  • The PR name is suitable for the change log.
  • The solution is tested.
  • The PR has a priority label.

@upbqdn upbqdn added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG P-Medium ⚡ labels Oct 10, 2024
@upbqdn upbqdn self-assigned this Oct 10, 2024
@upbqdn upbqdn requested a review from a team as a code owner October 10, 2024 10:44
@upbqdn upbqdn requested review from arya2 and removed request for a team October 10, 2024 10:44
@upbqdn upbqdn marked this pull request as draft October 10, 2024 10:45
@upbqdn upbqdn removed the request for review from arya2 October 10, 2024 10:45
@oxarbitrage oxarbitrage added the do-not-merge Tells Mergify not to merge this PR label Oct 10, 2024
@upbqdn upbqdn force-pushed the docker-refactor branch 3 times, most recently from eda21a9 to 32ef2ef Compare October 11, 2024 14:29
The test was using a custom config file set in `test_variables`.
However, the file was not included in the Docker image, and the
entrypoint script created a new, default one under the original file's
path. Zebra then loaded this new file, and the test passed because the
pattern in `grep_patterns` matched Zebra's output containing the
original path, even though the config file was different.
@upbqdn upbqdn removed the do-not-merge Tells Mergify not to merge this PR label Oct 14, 2024
@upbqdn upbqdn marked this pull request as ready for review October 14, 2024 11:34
@gustavovalverde
Copy link
Member

@upbqdn can the motivation be expanded/updated here? For future reference, it might be confusing if someone looks at the PR and understand all these changes were required to fix the use of the root user.

Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully review the entrypoint.sh as some of these requested changes might have a slight impact there. Changes to CI files—unless they're related to the Docker changes—should be in a different PR that depends on this one.

Comment on lines -192 to -205
ARG UID=10001
ENV UID=${UID}
ARG GID=10001
ENV GID=${GID}

RUN addgroup --system --gid ${GID} ${USER} \
&& adduser \
--system \
--disabled-login \
--shell /bin/bash \
--home ${APP_HOME} \
--uid "${UID}" \
--gid "${GID}" \
${USER}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be instances where a user would like to (re-)build the image with a custom UID:GUID, as they might require to mount files from their host, which will be incompatible with the UID:GUID of the container user, and it's also a good Docker practice to specify the UID for the default user for other edge-cases.

Unless there's a good reason to remove this, I'd suggest to keep it.

The PR that added this has some references to the reasoning behind it: #8803 (comment)

Copy link
Member Author

@upbqdn upbqdn Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're creating a system user. Those users should have UIDs in [1, 999].

In Debian, the docs for adduser say that the default dynamic range for system UIDs is defined by [FIRST_SYSTEM_UID, LAST_SYSTEM_UID], which is [100, 999] in etc/adduser.conf: https://manpages.debian.org/bullseye/adduser/adduser.8.en.html

Moreover, the Linux spec says that system UIDs in [100, 499] should be reserved for dynamic allocation: https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/uidrange.html

I'm surprised the Docker docs don't mention this.

adduser automatically checks if a UID is available before assigning one from the right range. Since I wasn't sure what UID to pick, and no user brought it up, I removed it to avoid scope creep. I'd prefer to address this issue according to the spec once a user brings it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not clear in Docker's documentation. But some of their tools, docker init for example, output the recommended approach, which is the following (with alpine linux):

ARG UID=10001
RUN adduser \
    --disabled-password \
    --gecos "" \
    --home "/nonexistent" \
    --shell "/sbin/nologin" \
    --no-create-home \
    --uid "${UID}" \
    appuser
USER appuser

In any case, to match their recommendation in debian, this would change to:

RUN addgroup --system --gid ${GID} ${USER} \
    && useradd \
    --uid "${UID}" \
    --no-create-home \
    --home-dir /nonexistent \
    --shell /usr/sbin/nologin \
    --comment "" \
    appuser

They main reasons to keep this instruction are:

  • Host and Container Alignment: When you mount host directories into the container, file permissions are based on UID and GID. If the container's user ID match those on the host, you avoid permission issues.

    • I've experienced this myself when I was testing the docker-compose and mounting a cached state from my host into the container, which had different UID:GID, and the container was not able to write to it.
  • Predictable IDs: By specifying the UID and GID, you ensure that the user inside the container has the same IDs across different builds, deployments, and host systems.

Additionally, this is very common in Docker, and plenty of projects use this approach to deal with these and other use-cases: https://github.com/search?q=%22useradd%22+%22--uid%22++language%3ADockerfile&type=code

Comment on lines -174 to -176
ARG APP_HOME
ENV APP_HOME=${APP_HOME}
WORKDIR ${APP_HOME}
Copy link
Member

@gustavovalverde gustavovalverde Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set a WORKDIR otherwise the user will end up in the / directory with no permissions, which they might require for testing/personal purposes.

The ARG + ENV combination also allows the user to set a custom directory, in case their host permissions does not allow the one we've chosen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member Author

@upbqdn upbqdn Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System users should have no home dir and should not even be able to log in to the machine. Our users should use -u root when they're logging to the machine. The logic I had in mind is very simple and minimal:

  • The whole runtime target has its entrypoint executed under the non-privileged zebra system user which has no home dir and no login (just as in a regular Linux env).
  • The Dockerfile sets up the minimal requirements for the zebra user to execute the entrypoint.
  • When our user wants to do some tweaks, they explicitly use -u root, which they can always do, and which gives them the clarity that they have full privileges.

I wanted to describe this in our docs in a subsequent PR. Is it OK if we do it like that?

Comment on lines -208 to -209
ARG FEATURES
ENV FEATURES=${FEATURES}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the runtime stage is built with custom FEATURES this will be propagated by default as an ENV variable to the entrypoint.sh, if we remove this instructions, then we need to always specify the FEATURES environment variable when running the image, otherwise it will be empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that var is not used in any meaningful way in entrypoint.sh, so I removed it. I would like to refactor the way we configure Zebra inside Docker in follow-up PRs.

Copy link
Member

@gustavovalverde gustavovalverde Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here:
https://github.com/ZcashFoundation/zebra/pull/8923/files#diff-4f5cabe26761257a4d685a6edc7a43e0fe0f78762f50eeb48530f2bd3b3ee7caR81
and here:
https://github.com/ZcashFoundation/zebra/pull/8923/files#diff-4f5cabe26761257a4d685a6edc7a43e0fe0f78762f50eeb48530f2bd3b3ee7caR100

From our documentation, if we suggest:

docker build -f ./docker/Dockerfile --target runtime --build-arg FEATURES='default-release-binaries prometheus' --tag local/zebra.mining:latest .

The entrypoint will evaluate the $FEATURES var and it will be empty, as it was never defined as a variable. This would be confusing for the user as they built it with --build-arg FEATURES='default-release-binaries prometheus' but the configuration file is not adding the corresponding section, as that argument was not passed as an Environment variable (at build time) to the container.

Comment on lines -216 to -217
RUN mkdir -p ${ZEBRA_CONF_DIR} && chown ${UID}:${UID} ${ZEBRA_CONF_DIR} \
&& chown ${UID}:${UID} ${APP_HOME}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a HOME directory for the application is a good practice, it also starts the container in an empty directory the users can use as they see fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System users should have no home dir. Our users should go with -u root. Another approach would be to execute the entrypoint under root, and then run zebrad under the zebra system user. Our users could then login implicitly as root without -u root, but that adds a bit of extra complexity to the entrypoint, which I wanted to keep simple for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not allowing the user to log in with the user running Zebra, then we should document the approach they can use to get a bash/sh terminal in the container to troubleshoot in it.

docker/Dockerfile Show resolved Hide resolved
prepare_env_vars

if [[ ! -f "${ZEBRA_CONF_PATH}" ]] && [[ -d "${ZEBRA_CONF_DIR}" ]]; then
ZEBRA_CONF_PATH="${ZEBRA_CONF_DIR}/zebrad.toml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default should be set in the Dockerfile, to keep it as the single source of truth for default variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which var do you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the var you defined: ZEBRA_CONF_PATH. But have the default value defined in the Dockerfile, as we do here:

ENV ZEBRA_CONF_DIR=${ZEBRA_CONF_DIR}

Comment on lines -154 to -166
# Test reconfiguring the the docker image for tesnet.
test-configuration-file-testnet:
name: Test CI testnet Docker config file
# Make sure Zebra can sync the genesis block on testnet
uses: ./.github/workflows/sub-test-zebra-config.yml
with:
test_id: 'testnet-conf'
docker_image: ${{ vars.GAR_BASE }}/${{ vars.CI_IMAGE_NAME }}@${{ inputs.image_digest }}
grep_patterns: '-e "net.*=.*Test.*estimated progress to chain tip.*Genesis" -e "net.*=.*Test.*estimated progress to chain tip.*BeforeOverwinter"'
# TODO: improve the entrypoint to avoid using `ENTRYPOINT_FEATURES=""`
test_variables: '-e NETWORK -e ZEBRA_CONF_PATH="/etc/zebrad/zebrad.toml" -e ENTRYPOINT_FEATURES=""'
network: 'Testnet'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was created to confirm that any change we do in CI or in Docker doesn't affect the ability to read the proper $NETWORK environment variable. As it had happened before that some changes breaks this behavior, and then the tests are running in Mainnet instead of Testnet, but we realized too late or had to wait for some tests to run to confirm it.

Copy link
Member Author

@upbqdn upbqdn Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing with this PR, similarly to the other one. I have a better approach in mind, which I didn't do in this PR. Let's add it back in a subsequent PR?

Moreover, there seem to be some parts that are bugs or hard to understand. For example, I couldn't figure out what -e NETWORK in test_variables is supposed to do. Also, setting ENTRYPOINT_FEATURES to the empty string to enable the test in the entrypoint makes it very hard to follow the execution path in the whole pipeline.

Copy link
Member

@gustavovalverde gustavovalverde Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-e NETWORK tells docker to use whatever value the $NETWORK env variable is set to, to override any default that was set at build time, or in the Dockerfile. This happens here:

env:
NETWORK: '${{ inputs.network }}'
and makes Zebra run with a Testnet configuration, and the test validates that Zebra is correctly running with it.

This has saved me (and others) from making mistakes multiple times while making CI refactors, so this is very important under those circumstances.

I do agree that setting the ENTRYPOINT_FEATURES to an empty string is a dirty hack to make this work, but that's a tech debt that wouldn't justify removing the whole test. In any case, we can remove the use of the ENTRYPOINT_FEATURES variables, while keeping this test behavior.

I'd suggest commenting this and adding a TODO in top of it, or creating an issue, instead of removing the test. Just so we don't forget later on, as this is an important validation.

Comment on lines 166 to 199
# Test that Zebra works using $ZEBRA_CONF_PATH config
test-zebra-conf-path:
name: Test CD custom Docker config file
needs: build
uses: ./.github/workflows/sub-test-zebra-config.yml
with:
test_id: 'custom-conf'
docker_image: ${{ vars.GAR_BASE }}/zebrad@${{ needs.build.outputs.image_digest }}
grep_patterns: '-e "loaded zebrad config.*config_path.*=.*v1.0.0-rc.2.toml"'
test_variables: '-e NETWORK -e ZEBRA_CONF_PATH="zebrad/tests/common/configs/v1.0.0-rc.2.toml"'
network: ${{ inputs.network || vars.ZCASH_NETWORK }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although very simple, the objective of this test is to confirm that the entrypoint is able to handle a custom configuration path ($ZEBRA_CONF_PATH), run with it and confirm that path is being used.

This could be extended to validate is running with a mounted file using --mount type=bind,source="$(pwd)"/target,target=/app as part of the test_variables, but that's out of scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was using a custom config file set in test_variables.
However, the file was not included in the Docker image, and the
entrypoint script created a new, default one under the original file's
path. Zebra then loaded this new file, and the test passed because the
pattern in grep_patterns matched Zebra's output containing the
original path, even though the config file was different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails in this PR due to the fixes in the entrypoint.

Copy link
Member

@gustavovalverde gustavovalverde Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest commenting this and adding a TODO in top of it, or creating an issue. Just so we don't forget later on.

Either of them I'd suggest indicating something like:

We need to create a test that validates we can mount a configuration file to a different path that the default used by ZEBRA_CONF_PATH, and that Zebra runs using this new file and path.

@upbqdn
Copy link
Member Author

upbqdn commented Oct 17, 2024

@upbqdn can the motivation be expanded/updated here? For future reference, it might be confusing if someone looks at the PR and understand all these changes were required to fix the use of the root user.

I updated the PR description.

@upbqdn
Copy link
Member Author

upbqdn commented Oct 18, 2024

My yml linter updated the formatting of the cd-deploy-nodes-gcp.yml file when merging main. I'm happy to revert those changes if they don't fit.

@arya2 arya2 added the do-not-merge Tells Mergify not to merge this PR label Oct 26, 2024
@mpguerra
Copy link
Contributor

Do we want to do anything further here? Otherwise we should either merge or close without merging

@gustavovalverde
Copy link
Member

gustavovalverde commented Nov 14, 2024

Do we want to do anything further here? Otherwise we should either merge or close without merging

I added some review comments which are pending for a reply, and we should also consider the latest interaction we had with some users in Discord, as some changes related to permission handling, mounting a configuration file, and the use of curl, etc, should be considered so we don't break those use-cases.

@upbqdn
Copy link
Member Author

upbqdn commented Nov 14, 2024

I'm planning to address the comments and get the PR approved.

@mpguerra mpguerra removed the do-not-merge Tells Mergify not to merge this PR label Nov 21, 2024
@arya2 arya2 added do-not-merge Tells Mergify not to merge this PR and removed do-not-merge Tells Mergify not to merge this PR labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG P-Medium ⚡
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants