I use Continuous Integration (CI) extensively across almost all of my remote Git repositories. These are the typical jobs which it’s used for:
- running tests
- building documentation and
- acquiring data.
This post addresses the last item, acquiring data.
The workflow is typically something like this:
- build project
- run a script (or scripts to gather data)
- do some cleaning or other data preparation
- commit the new data and pushing back to the remote repository.
Obviously this approach only works for acquiring data of modest size where it can be usefully stored in a Git repository.
In the sections below I dig into the details of the final step in the workflow, adding the data to the repository and pushing it back to the remote.
GitLab
Most of my projects are hosted on GitLab, where I make extensive use of the ability to create project hierarchies using groups and sub-groups. These are generally private repositories.
Create a Personal Access Token
To be able to push back to the remote repository you’ll need to create a Personal Access Token (PAT) with write_repository scope. I will generally create one PAT per project because this gives me more granular control over access (as opposed to having a single PAT which is used across multiple projects).
Under Settings ➤ CI/CD ➤ Variables in the repository add an environment variable, GITLAB_PAT, which contains the PAT.
🚨 A token is automatically created for each pipeline and available via the CI_JOB_TOKEN environment variable. In Settings ➤ CI/CD ➤ Job token permissions check Allow Git push requests to the repository.
Setup CI
Now you need to add or amend a .gitlab-ci.yml file.
If you want to add content directly to the main branch, use something like this:
gather:
interruptible: false
only:
- main
before_script:
- git config --global user.email "${GITLAB_USER_EMAIL}"
- git config --global user.name "${GITLAB_USER_NAME}"
script:
- date +"%Y%m%d-%H%M%S" >>main-times.txt
after_script:
- git add -f main-times.txt
- git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
- git push -o ci.skip https://gitlab-ci-token:CI_JOB_TOKEN@$CI_SERVER_HOST/$CI_PROJECT_PATH.git HEAD:main
💡 If you have a master branch, just substitute master for all occurrences of main.
A couple of notes about what’s going on there:
- Instead of generating a timestamp using the
datecommand you could use theCI_JOB_STARTED_ATenvironment variable. - The
-o ci.skipoption is important because it prevents GitLab from immediately trying to run the CI workflow on the resulting commit.
It might make more sense to commit the new data onto a separate branch, in which case try something like this:
gather-branch:
interruptible: false
variables:
DATA_BRANCH: collect
only:
- main
before_script:
- git config --global user.email "${GITLAB_USER_EMAIL}"
- git config --global user.name "${GITLAB_USER_NAME}"
- git checkout -B $DATA_BRANCH || true
script:
- date +"%Y%m%d-%H%M%S" >>branch-times.txt
after_script:
- git add -f branch-times.txt
- git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
- git push -o ci.skip https://gitlab-ci-token:CI_JOB_TOKEN@$CI_SERVER_HOST/$CI_PROJECT_PATH.git $DATA_BRANCH
🚨 If you get an authentication failure with the above approach then try replacing http with https in the git push command.
GitHub
My public repositories are most often hosted on GitHub.
Setup Action
Actions are configured via YAML files in the project’s .github/workflows directory. Create an appropriately named .yml file in that directory and copy the configuration below. This workflow has two jobs:
gather— commits content to themain(default) branch; andgather-branch— commits content to thecollectbranch.
In practice you’d chose one of these approaches which suits your needs and delete the configuration for the other.
📢 Update Setting the permissions of the job to write seems to be important.
on:
workflow_dispatch:
push:
branches:
- main
jobs:
gather:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v3
- name: Setup Git
run: |
git config --local user.email "actions@github.com"
git config --local user.name "GitHub Actions"
- name: Update file
run: date +"%Y%m%d-%H%M%S" >>main-times.txt
- name: Commit & push data
run: |
git add -f main-times.txt
git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
git push
gather-branch:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v3
with:
ref: collect
- name: Setup Git
run: |
git config --local user.email "actions@github.com"
git config --local user.name "GitHub Actions"
- name: Update file
run: date +"%Y%m%d-%H%M%S" >>branch-times.txt
- name: Commit & push data
run: |
git add -f branch-times.txt
git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
git push