Abusing the Git version control system as a distributed filesystem.
Posted by peeterjoot on December 16, 2009
I’ve switched from RCS to GIT as the VCS for my personal math and physics play latex source. I am currently using github to host all this stuff, and while this means that everybody has access to my sources (even drafts), I don’t really mind too much. Since there are not many that read even the final versions of this play math, this draft state is likely of even less interest.
Having done this and learned some GIT basics, I have also made the somewhat curious choice of using this VCS to host my personal internal scripts at work. I have many of these scripts checked into our clearcase repository which gives me access on all our development machines, so why would I use GIT?
The why is because it is easy. It is a bit of a pain to make quick and hacky changes to things that I have checked into clearcase. I have to open and accept a defect, merge and checkin my branch, and wait for the cronjob for our tools snapshot view to kick in and reload any changes I’ve made.
This inconvience is enough that I’ve gradually ended up with a hodge podge mix of some things in clearcase, some things on local file systems, some on our “deprecated” lab AFS (distributed filesystem), and some on NFS. I got a bit tired of this, and am now migrating all my private scripts and junk to git uniformly. This gives me the benefit of version control, while retaining the ease of modification that I have with purely local files. Because all GIT repository copies are all just as good as the other, I also have no dependency on flaky NFS servers.
Unlike my personal stuff I am obviously not using a public github hosted repository for work related stuff, but all I need is ssh keys available on my development machines to host things internally on any number of potential locations. I can push and pull local changes on any specific machine that I happen to be working with at the time with a –bare repository on an arbitrarily elected “repo server”. If that machine goes down any of the other recently used versions of my repository on some other machine can function as the master (either temporarily or permanently).
Setup a local repository
Getting started is pretty easy. Something like this will do the trick to get yourself an initial repository
$ cd $ mkdir myjunk $ cd myjunk $ git init
You’ve now got a git repo with nothing in it. Supposing you’ve already got a crapload of stuff that you want under version control in directory ~/stuff, do the following
$ cp -a ~/stuff . $ find stuff | xargs git add $ git commit -a
The add tells the git repository about the file, directory or symlinks that you’ve copied into your repository. This is just a placeholder for the object and the ‘git commit’ actually creates it. If your intention was to sync this with a master (perhaps public like github) repository then nobody else will see it yet. If you were to, say, loose your harddrive at this point without backup, then even commited are toast because they haven’t been synced up (pushed) with anybody else.
One of the reasons I like using RCS is that it is really easy. You can get away with just a couple commands (‘co -l’, or ‘ci -l’ and rcsdiff). Git is actually easier. Once you’ve got a git repository directory created, checkout is implicit, so you just have to edit. ‘git commit filename’, or ‘git commit -a’ is the checkout and checkin equivalent, much like a ‘ci -l’ in RCS.
Setup a master repository
If your aim, like mine, is to share stuff across multiple machines, then you’ll want a separate master copy of the repository in addition to the working version you started with. This version will be different, in that it houses only the VCS meta and raw data, and has no visible directory structure. Such a master repository can be created with ‘git init –bare’, but we can also create it as a copy directly with something like: Creation is the same, but you’ll want a different directory name, and also use the ‘–bare’ flag when you create it. This would be something like:
$ cd $ git clone --bare myjunk .myjunk.git Initialized empty Git repository in /home/peeterj/.myjunk.git/
A directory listing will show you something like:
$ ls .myjunk.git branches config description HEAD hooks info objects packed-refs refs
The config file and other stuff that was in the .git directory in a non-bare repository is now in the top most directory. Having created this I can now go to my working repository and use set this as the master copy to synchronize with
$ cd ~/myjunk $ git remote add origin peeterj@machine1:test/.myjunk.git $ cat .git/config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] url = peeterj@machine1:.myjunk.git fetch = +refs/heads/*:refs/remotes/origin/* $ git push origin master Everything up-to-date $ git pull origin master From machine1:.myjunk * branch master -> FETCH_HEAD Already up-to-date.
Now getting your code on another machine is just another clone call, like
$ git clone peeterj@machine1:.myjunk.git anotherjunk Initialized empty Git repository in /home/peeterj/anotherjunk/.git/ remote: Counting objects: 4430, done. remote: Compressing objects: 100% (4124/4124), done. remote: Total 4430 (delta 1216), reused 0 (delta 0) Receiving objects: 100% (4430/4430), 8.12 MiB | 3.60 MiB/s, done. Resolving deltas: 100% (1216/1216), done.
Here you specify the repository created with –bare as the location to copy from. By default git uses ssh, so have that setup for passwordless login (or be prepared to supply your password on each push and pull).
Some basic commands
Now that a master repository is setup, and working copies are in place on two or more machines, we are set to use it. One of the real powers of any modern VCS is the ability to handle merges and concurrent updates, but if using this as a personal distributed file system you probably can avoid any knowledge of how to do this for quite a while. Before making updates on a machine that hasn’t been used for a while, a pull will get you anything you’ve pushed recently. This could look something like
$ git pull origin master remote: Counting objects: 50, done. remote: Compressing objects: 100% (43/43), done. remote: Total 43 (delta 30), reused 0 (delta 0) Unpacking objects: 100% (43/43), done. From machine1:.myjunk * branch master -> FETCH_HEAD Updating e3c0c9c..ff49140 Fast forward bin/README | 4 + bin/cfKiller | 220 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ bin/cfpool | 4 +- bin/homeclean | 2 +- bin/killca | 1 + bin/updateLoop | 2 +- 6 files changed, 230 insertions(+), 3 deletions(-) create mode 100755 bin/cfKiller
and when you are done working on this machine for the day, or when you want to sync something up for use on a different machine, commit anything outstanding (ie. checkin), and then push it to the master for a pull from somewhere else.
$ git commit -a ... ".git/COMMIT_EDITMSG" 13L, 333C written [master 13ec34c] add -noauto 1 files changed, 65 insertions(+), 1 deletions(-) rewrite bin/fm (100%) $ git push origin master Counting objects: 7, done. Delta compression using up to 8 threads. Compressing objects: 100% (4/4), done. Writing objects: 100% (4/4), 1.21 KiB, done. Total 4 (delta 2), reused 0 (delta 0) To peeterj@machine1:.myjunk.git ff49140..13ec34c master -> master
That’s all there is to using git as an ad-hoc distributed file system. It can be used this way like a version controlled rsync.
merging and branching … or not.
For my own use only to synchronize things across multiple machines, I don’t have any reason to use the branching or merging facilities. Merging is actually fairly intuitive, and I tried introducing a couple of conflicts since I was curious how it was done. If a conflicting change has been pushed, a pull will notify you of a merge requirement, and the default merge method appears to leave diff3 -m output in the file to be merged. Edit that, run ‘git add ./path_to_conflicting_file’ to mark it merged, commit the file(s), and push and you are done.