I use Continuous Integration (CI) extensively across almost all of my remote Git repositories. These are the typical jobs which it’s used for:
- running tests
- building documentation and
- acquiring data.
This post addresses the last item, acquiring data.
The workflow is typically something like this:
- build project
- run a script (or scripts to gather data)
- do some cleaning or other data preparation
- commit the new data and pushing back to the remote repository.
Obviously this approach only works for acquiring data of modest size where it can be usefully stored in a Git repository.
In the sections below I dig into the details of the final step in the workflow, adding the data to the repository and pushing it back to the remote.
GitLab
Most of my projects are hosted on GitLab, where I make extensive use of the abilitity to create project hierarchies using groups and sub-groups. These are generally private repositories.
Create a Personal Access Token
To be able to push back to the remote repository you’ll need to create a Personal Access Token (PAT) with write_repository
scope. I will generally create one PAT per project because this gives me more granular control over access (as opposed to having a single PAT which is used across multiple projects).
Under Settings ➤ CI/CD ➤ Variables in the repository add an environment variable, GITLAB_PAT
, which contains the PAT.
Setup CI
Now you need to add or amend a .gitlab-ci.yml
file.
If you want to add content directly to the main
branch, use something like this:
gather:
interruptible: false
only:
- main
before_script:
- git config --global user.email "${GITLAB_USER_EMAIL}"
- git config --global user.name "${GITLAB_USER_NAME}"
script:
- date +"%Y%m%d-%H%M%S" >>main-times.txt
after_script:
- git add -f main-times.txt
- git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
- git push -o ci.skip https://gitlab-ci-token:$GITLAB_PAT@$CI_SERVER_HOST/$CI_PROJECT_PATH.git HEAD:main
💡 If you have a master
branch, just substitute master
for all occurrences of main
.
A couple of notes about what’s going on there:
- Instead of generating a timestamp using the
date
command you could use theCI_JOB_STARTED_AT
environment variable. - The
-o ci.skip
option is important because it prevents GitLab from immediately trying to run the CI workflow on the resulting commit.
It might make more sense to commit the new data onto a separate branch, in which case try something like this:
gather-branch:
interruptible: false
variables:
DATA_BRANCH: collect
only:
- main
before_script:
- git config --global user.email "${GITLAB_USER_EMAIL}"
- git config --global user.name "${GITLAB_USER_NAME}"
- git checkout -B $DATA_BRANCH || true
script:
- date +"%Y%m%d-%H%M%S" >>branch-times.txt
after_script:
- git add -f branch-times.txt
- git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
- git push -o ci.skip https://gitlab-ci-token:$GITLAB_PAT@$CI_SERVER_HOST/$CI_PROJECT_PATH.git $DATA_BRANCH
🚨 If you get an authentication failure with the above approach then try replacing http
with https
in the git push
command.
GitHub
My public repositories are most often hosted on GitHub.
Setup Action
Actions are configured via YAML files in the project’s .github/workflows
directory. Create an appropriately named .yml
file in that directory and copy the configration below. This workflow has two jobs:
gather
— commits content to themain
(default) branch; andgather-branch
— commits content to thecollect
branch.
In practice you’d chose one of these approaches which suits your needs and delete the configuration for the other.
on:
workflow_dispatch:
push:
branches:
- main
jobs:
gather:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Git
run: |
git config --local user.email "actions@github.com"
git config --local user.name "GitHub Actions"
- name: Update file
run: date +"%Y%m%d-%H%M%S" >>main-times.txt
- name: Commit & push data
run: |
git add -f main-times.txt
git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
git push
gather-branch:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
ref: collect
- name: Setup Git
run: |
git config --local user.email "actions@github.com"
git config --local user.name "GitHub Actions"
- name: Update file
run: date +"%Y%m%d-%H%M%S" >>branch-times.txt
- name: Commit & push data
run: |
git add -f branch-times.txt
git commit -m "Record date & time [$(date +'%Y-%m-%d %H:%M')]"
git push