Creating an SCU Upload Template =============================== This documentation provides guidelines for creating an SCU upload template JSON file that adheres to the schema defined in `scu_raw.json`. The template is used by the `generic_scu_transfer` module to discover and upload files generated by various scientific computations. SCU Upload Template Structure ============================= An SCU upload template JSON file must include the following top-level properties: - **name** (required): The name of the schema. - **schema_version** (required): The version of the schema. - **description**: A brief description of the schema. - **compute_type** (required): The name of the application that represents this SCU. - **discovery_rules** (required): A set of rules to discover files. - **elements** (required): A list of elements representing different types of files. Discovery Rules =============== Discovery rules are used to find files on the filesystem based on specific criteria. Each rule must have the following properties: - **name** (required): A unique name for the rule. - **type** (required): The type of rule. Supported types are `pattern` and `paired_group`. - **description**: A brief description of the rule. Default is an empty string. - **criteria** (required): A dictionary containing the criteria for the rule. Pattern Rule Criteria --------------------- For rules of type `pattern`, the criteria dictionary can include the following parameters: - **pattern** (required): A glob pattern to match files. - **required**: A boolean indicating whether finding files is required. Default is `false`. - **max_depth**: The maximum directory depth to search. Default is `None`. - **min_depth**: The minimum directory depth to search. Default is `None`. - **num_files**: The number of files expected to be found. Default is `None`. Paired Group Rule Criteria -------------------------- For rules of type `paired_group`, the criteria dictionary can include the following parameters: - **group_by** (required): A string indicating how to group files. Currently, only `basename` is supported. - **pattern**: A glob pattern to match files. Default is an empty string. - **extensions** (required): A list of file extensions to examine for pairing. - **required**: A boolean indicating whether finding files is required. - **max_depth**: The maximum directory depth to search. - **min_depth**: The minimum directory depth to search. Elements ======== Elements represent different types of files and must include the following properties: - **name**: The name of the element. - **description**: A brief description of the element. - **type**: The type of files indexed by the element. - **discovery_rules**: A list of discovery rule names to apply to this element. Registering a New SCU Upload Template ===================================== To register a new SCU upload template, create a JSON file in the `resources/raw_scu_schema` directory of your zcloud release. The schema must conform to the meta schema defined in `resources/meta_schema/scu_raw.json`. You can test this manually with :py:func:`zcloud.load_raw_schema`, or simply try to run scumover with it if you're feeling spicy. Examples ======== Below are examples of SCU upload templates for Omegafold, RFD, and MPNN, as well as a "bad behavior" example. Omegafold Example ----------------- .. code-block:: json { "name": "omegafold_raw_default", "schema_version": "1.0", "description": "Schema for the raw output provided by an Omegafold run", "compute_type": "Omegafold", "discovery_rules": { "rules": [ { "name": "find_ofold_raw_output", "type": "pattern", "description": "Rule for finding all output files, including the log file, rosetta score etc", "criteria": { "pattern": "*.pdb", "required": true, "max_depth": 0, "min_depth": 0 } }, { "name": "find_input_seqs", "description": "find the fasta files", "type": "pattern", "criteria": { "pattern": "*.fa", "required": false, "max_depth": 0, "min_depth": 0 } } ] }, "elements": [ { "name": "ofold_raw_output", "description": "Raw output provided by an omegafold run", "type": "files", "discovery_rules": [ "find_ofold_raw_output" ] }, { "name": "input_seqs", "description": "FASTA files", "type": "files", "discovery_rules": [ "find_input_seqs" ] } ] } RFD Example ----------- .. code-block:: json { "name": "RFD_raw_default", "schema_version": "1.0", "description": "Schema for the raw output provided by an RFD run", "compute_type": "RFDiff", "discovery_rules": { "rules": [ { "name": "find_rfd_raw_output", "type": "paired_group", "description": "Rule for finding paired pdbs/trbs, keeping max and min depth to 1 to avoid the ./traj folder", "criteria": { "group_by": "basename", "pattern": "*.{ext}", "extensions": [ ".trb", ".pdb" ], "required": true, "max_depth": 1, "min_depth": 1 } }, { "name": "find_rfd_config_json", "description": "Assumes the only json we have is the config json that the rfd wrapper was run with", "type": "pattern", "criteria": { "pattern": "*.json", "required": true, "num_files": 1, "max_depth": 0, "min_depth": 0 } }, { "name": "find_rfd_trajectory", "description": "find the pdb files in ./traj", "type": "pattern", "criteria": { "pattern": "*/traj/*.pdb", "required": false, "max_depth": 2, "min_depth": 2 } } ] }, "elements": [ { "name": "rfd_config", "description": "The input config json for the rfd run", "type": "rfd_config_json", "discovery_rules": [ "find_rfd_config_json" ] }, { "name": "rfd_output", "description": "The output of the rfd run, pdb & trb", "type": "files", "discovery_rules": [ "find_rfd_raw_output" ] }, { "name": "rfd_traj", "type": "files", "description": "The pdb files in the ./traj folder", "discovery_rules": [ "find_rfd_trajectory" ] } ] } MPNN Example ------------ .. code-block:: json { "name": "MPNN_raw_default", "schema_version": "1.0", "description": "Schema for the raw output provided by an MPNN run", "compute_type": "MPNN", "discovery_rules": { "rules": [ { "name": "find_mpnn_raw_output", "type": "pattern", "description": "Rule for finding all output files, including the log file, rosetta score etc", "criteria": { "pattern": "*", "required": true, "max_depth": 1, "min_depth": 1 } }, { "name": "find_mpnn_config_json", "description": "Assumes the only json we have is the config json that the mpnn wrapper was run with", "type": "pattern", "criteria": { "pattern": "mpnn*_log.json", "required": true, "num_files": 1, "max_depth": 1, "min_depth": 1 } }, { "name": "find_mpnn_seqs", "description": "find the fasta files", "type": "pattern", "criteria": { "pattern": "*.fa", "required": false, "max_depth": 1, "min_depth": 1 } } ] }, "elements": [ { "name": "mpnn_raw_output", "description": "Raw output provided by an MPNN run", "type": "files", "discovery_rules": [ "find_mpnn_raw_output" ] }, { "name": "mpnn_config_json", "description": "Config JSON that the MPNN wrapper was run with", "type": "mpnn_config_json", "discovery_rules": [ "find_mpnn_config_json" ] }, { "name": "mpnn_seqs", "description": "FASTA files", "type": "files", "discovery_rules": [ "find_mpnn_seqs" ] } ] } If you're reading this, you're clearly not interested in all that technical mumbo jumbo, you just want your new SCU to run. Here's the "bad behavior" template, which should work for just about anything: The default behavior at the time of writing this example of :py:mod:`zcloud.generic_scu_transfer` is to find everything if no argument is specified. This is not strictly a bad behavior, since the "identity" of upload components is largely arbitrary. The more important thing is to track the metadata like the project, authors, and experiment name, the SCU that actually generated the data, and the upload timestamp. All this is handled automatically, even with this generic template. Bad Behavior Example -------------------- .. code-block:: json { "name": "bad_behavior", "schema_version": "1.0",, "compute_type": "BadBehavior", "discovery_rules": { "rules": [ { "name": "find_everything_idk", "type": "pattern", "criteria": { "pattern": "*" } } ] }, "elements": [ { "name": "all_files", "type": "files", "discovery_rules": [ "find_everything_idk" ] } ] }