SCU_upload
==========

.. toctree::
    :maxdepth: 2

    resources.raw_scu_schema

A frequent pattern at Zip is the use of scientific software under outside development. In order to maintain simplicity of development and modularity, it is important to be able to integrate outputs from a variety of data sources in a sane manner.

This documentation outlines some tools in zcloud attempting to solve that problem.


Tools and Resources
===================

:py:mod:`zcloud.generic_scu_transfer`

and 

:py:mod:`zcloud.console_scripts.scumover`

are the primary tools developed to upload data with a unified tracking format.


Supported SCU Upload Templates
==============================

The following SCU upload templates are supported by the current version of zcloud:

- RFD_raw_default
- omegafold_raw_default
- MPNN_raw_default

If you wish to add a new template, you must register it in the resources/raw_scu_schema directory of your zcloud release. The schema must conform to the meta schema in resources/meta_schema/scu_raw.json.

Documentation on registering a new SCU upload template is available here:

:doc:`SCU Upload Template<resources.raw_scu_schema>`


Generic Upload Example
======================

The :py:mod:`zcloud.console_scripts.scumover` module exposes at the CLI of any environment (including a container) a script for uploading data to GCP. The syntax is fairly straightforward:

.. code-block:: bash

    scumover SCU_CONFIG_NAME path/to/output \
    --project PROJECT \
    --subproject SUBPROJECT \
    --experiment EXPERIMENT \
    --authors author1,author2 \
    --bucket bucket-name \
    --gcp-project project-name


The SCU_CONFIG_NAME is the name of the SCU upload configuration json file in resources/raw_scu_schema.
 The path/to/output is the path to the output directory you want to upload.
 The rest of the arguments are the metadata you want to attach to the upload.
 The only required arguments are the SCU_CONFIG_NAME and the path/to/output, though it is generally recommended to be explicit about the upload target as well.


For example for a default RFD run, the template file is at resources/raw_scu_schema/RFD_raw_default.json. 

Supposing your project was "protein-design-1", and your author name was "ascientist". The command to upload the output of a run to GCP would be:

.. code-block:: bash

    scumover RFD_raw_default path/to/output \
    --project "protein-design-1" \
    --authors ascientist


What happens under the hood is that the script is already configured to discover RFD output at that target path, index the different files by type, tag them all with your metadata, and upload them together with a manifest to a unique upload target location along with timestamp and compute type (RFD) which generated the data. Even if you do nothing else, a future developer or scientist can later discover this data programatically and know its source, author, age, and roughly, purpose, all without any special effort on your part.