JISC/UMF DataFlow Research Data Management Project Software Release

The JISC/University Modernisation Fund sponsored DataFlow Project at the University of Oxford have completed a beta software release and are looking for test users!

Please see the DataFlow project website at http://www.dataflow.ox.ac.uk/index.php/home/146-software-release

For more information about the JISC/UMF Shared Services and the Cloud Programme see: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

For detailed information on the beta software release see the message below from Katherine Fletcher of the DataFlow Project:

Software release!
We've done it!

It is my great pleasure to announce that the DataFlow project has just released the beta versions of DataStage and DataBank. There will be an official launch workshop to learn to use the software on 2 March in Oxford (register here http://www.eventbrite.co.uk/event/2804728017).  The software is currently available as pre-packaged virtual machines, and DataStage is also available as debian-packaged files you can install yourself. 

A few notes before you dive in:

1)    This is a beta release.
We're confident it will do the job, but it's not perfect yet. Please bear with us. You are warmly invited to use our JIRA issue tracker http://dataflow-jira.bodleian.ox.ac.uk/jira/browse/DATAFLOW to report what you find and suggest improvements (you will need to create a JIRA account but you do not need further permissions.  Log in, then click "create issue" in the top-right corner. Leave blank anything you're not sure about. Or just email Katherine Fletcher (katherine.fletcher[at]zoo.ox.ac.uk).

2)    Back up your data.
You can use live data in this system: no matter what may happen to DataStage or DataBank, if you have "root" access to the machine holding the files, you can always get your data back. But it will be quicker and easier to load a backup. Don't forget to include your new DataStage or DataBank in your usual backup routine.

3)    When it's time to update...
We expect to release the 1.0 version of the code as a standalone installation - users will need to reinstall it as a clean copy. You will have to re-create all the accounts, but the data files themselves will not be lost. DataStage will be able to pick the data files up again and reassign them correctly, so long as the administrator sets up identical account names with the fresh installation.  In the long run, we would like to use the debian packaging to make this smoother, so you would "install" the whole thing but only the changed system files would be updated, leaving the rest untouched. We will keep you posted!

4)    The easiest mistake to make:
If you are using live data, make sure you understand the permission system. If you put data in the system, thinking you'll work out how to restrict access later, you may accidentally expose it. By default, metadata held in DataBank is visible to Google and every other web crawler (although the underlying data is automatically under embargo, and cannot be accessed by outsiders).

5)    Help!  I need a user's guide!
Documentation, like the rest of the system, is a work in progress. We will keep updating the DataStage https://github.com/dataflow/DataStage/wiki and DataBank https://github.com/dataflow/RDFDatabank/wiki documentation wikis. You can get face-to-face help from our development team at the 2 March launch workshop. We'd love to meet you!  In the meantime, please look at (and post to) our mailing list: dataflow-devel[at]googlegroups.com. 

You can download both virtual machines, along with installation readme files, here:

Debian-packaged installation files for DataStage are available here:

All software is targeted at the Ubuntu Linux 11.10 Oneiric Ocelot http://ubuntuguide.org/wiki/Ubuntu_Oneiric operating system, and the VMs work with VMWare Fusion 4.x http://www.vmware.com/support/fusion4/doc/releasenotes_fusion_401.html.

p.s. a late-breaking note about creating a mapped drive for DataStage using Windows 7: You must map it to (not, as in the documentation).  Please use 'datastage' to map it as a drive.  Use 'data' only for http access.  We will update the documentation to reflect this.

Go back