zcloud.generic_scu_transfer module

This module provides a flexible, schema-configured pull-to-push adaptor to find files generated by a registered SCU and push them to a bucket. It supports some logical organization of raw files via “elements”, though it can be brute-forced with a single element and a sufficiently generic rule.

Functions

get_group_name()

complain_about_file_count()

run_discovery()

send_files_from_manifest()

Classes

DiscoveryRule

FileSchemaParser

class zcloud.generic_scu_transfer.DiscoveryRule(rule)[source]

Bases: object

Configurable file discovery rule. This class is used to find files on a filesystem based on instructions in a meta schema.

Note

This class is not optimized for performance in searching through large numbers of files or convoluted directory structures. It has O(n) complexity for each rule, where n is the number of files it is handed. One future improvement could be to have some “rule-aware” optimization at a higher level, or have rules with a lot of overlap run inside the same loop. Or … just don’t write 10,000 files in a single trajectory, how about that?

The rule configuration should have the following keys:

  • name: A unique name for the rule.

  • type: The type of rule. Currently, only ‘paired_group’, ‘pattern’, and ‘all_files’ are supported.

  • description (optional): A description of the rule.

  • criteria: A dictionary containing the criteria for the rule. The dictionary should have the following keys:
    • group_by: A string indicating how to group files. Currently, only ‘basename’ is supported. Group_by is only relevant for paired_group rules, ignored otherwise.

    • extensions: A list of strings indicating the file extensions to examine for pairing. Extensions may include the leading period. Extensions are only relevant for ‘paired_group’ rules, ignored otherwise.

    • max_depth: An integer indicating the maximum depth to search for files.

    • min_depth: An integer indicating the minimum depth to search for files.

    • fail_on_mismatch: A boolean indicating whether to raise an error if a pair is missing. If False or omitted, the rule will ignore files that do not have all partners.

    • pattern: A string containing a glob pattern to match files. The pattern should contain a placeholder (.{ext}) that will be replaced with the file extension for paired group rules. This replacement is also supported for single rules.

    • required: A boolean indicating whether finding files is required for this rule. If False or omitted, the rule will not raise an error if no files are found.

    • num_files: An integer indicating the number of files to find. If the number of files found does not match this number, an error will be raised. Paired_group rules will find this number of groups, while single rules will find this number of files. Incomplete groups are not counted towards this number, which can create unwanted behavior if the fail_on_mismatch and pattern criteria are abused.

Parameters:

rule (dict) – A dictionary containing the rule configuration.

apply(file_list)[source]
Return type:

list[str]

apply_all_files(file_list)[source]

Apply a rule to all files in the file list.

This is a catch-all rule which allows “rule criteria” without requiring patterns or extensions.

Parameters:

file_list (list) – List of files to search through.

Raises:

ValueError – can raise a ValueError via complain_about_file_count

Returns:

List of files that match the rule criteria

Return type:

list

apply_group_search(file_list)[source]

Apply a group search rule to a list of files.

Group search rules are used to find groups of files that match a certain pattern, but differ in extension or other patterns.

For example, if you have file1.pdb and file1.trb that are intrinsically related, you can use a group search rule to find both files, and even include rules that issue warnings or fail based on presence or absence of the group.

Detailed supported criteria are in the docs for the parent class.

Parameters:

file_list (list) – List of files to search through.

Returns:

List of files that match the group search rule.

Return type:

list

apply_pattern(file_list)[source]

Apply a pattern search rule to a list of files.

Pattern search rules are used to find individual files that match a certain pattern.

Parameters:

file_list (list) – List of files to search through.

Returns:

List of files that match the pattern search rule.

Return type:

list

class zcloud.generic_scu_transfer.FileSchemaParser(schema)[source]

Bases: object

discover_files(directory)[source]

Discover files for each element in the schema.

Elements can also have “sub-elements” and each element can have any number of discovery rules applied to it.

Element structure is implied to have at least:

{
"name": "element_name",
"description": "A brief description of the element",
"discovery_rules": ["rule1", "rule2"],
// And, optionally, with sub-elements (this can continue recursively, please avoid abusing it):
"elements": [
    {
    "name": "sub_element_name",
    "description": "A brief description of the sub-element",
    "discovery_rules": ["rule3", "rule4"],
    "elements": []
    }
]
// end optional sub-elements
}
Parameters:

directory (str) – The directory to search for files.

Returns:

A dictionary containing the discovered files for each element. Key: element name, Value: list of discovered files, unordered.

Return type:

dict

get_manifest_dict(element_files, manifest_schema_handler=None, **meta)[source]

Generates a manifest dictionary containing metadata and file information.

Parameters:

element_files (dict) – A dictionary where keys are element names and values are lists of file paths.

Returns:

A dictionary containing the following keys: - “schema_version”: The schema version. - “description”: A description of the manifest. - “elements”: Elements associated with the manifest. - “element_files”: The input element_files dictionary. - “uuid”: A unique identifier for the manifest. - “files”: A list of normalized file paths with the UUID prepended.

Return type:

dict

zcloud.generic_scu_transfer.check_depth(file, max_depth=None, min_depth=None)[source]
Return type:

bool

zcloud.generic_scu_transfer.complain_about_file_count(output_files, rule_name, num_files=None, required=None)[source]

Raise a ValueError if the number of files found does not match the expected number

The content of the error describes whether or not the rule requires the files to be found, and how many files were found.

Parameters:
  • output_files (set) – The set of files found by the rule.

  • rule_name (str) – The name of the rule.

  • num_files (int, optional) – The number of files expected to be found. If not provided, the error message will say “at least one”.

  • required (bool, optional) – Whether the rule requires the files to be found. If not provided, the error message will say the rule is not required.

Raises:

ValueError – If the number of files found does not match the expected number.

Return type:

None

zcloud.generic_scu_transfer.config_from_path(path, scusch=None)[source]
zcloud.generic_scu_transfer.get_group_name(filename, group_by, pattern=None, extension=None)[source]

Get the “group_name” for a file based on the “group_by” rules.

This is used by the “DiscoveryRule” class for the “paired_group” type of rule.

Parameters:
  • filename (str) – The filename to get the group name for.

  • group_by (str) – The group_by rule to use. Currently, only “basename” is supported.

  • pattern (str, optional) – The pattern to use for the group name. If not provided, the implicit pattern is *.{ext}.

  • extension (str, optional) – The extension to use for the group name. If not provided, the extension is not used.

Returns:

The group name for the file.

Return type:

str

zcloud.generic_scu_transfer.merge_jira_and_config(config_dict, ticket_dict)[source]
zcloud.generic_scu_transfer.read_manifest(bucket_name, manifest_path)[source]
zcloud.generic_scu_transfer.run_discovery(schema_name, directory, **meta)[source]

Discover files in a directory based on a given schema and generate a manifest.

Parameters:
  • schema_name (str) – The name of the schema to load.

  • directory (str) – The directory to search for files.

Returns:

A dictionary representing the manifest of discovered files.

Return type:

dict

zcloud.generic_scu_transfer.scrape_jira_ticket(ticket_dict)[source]
zcloud.generic_scu_transfer.send_files_from_manifest(bucket_name, project, manifest, base_dir='.')[source]
zcloud.generic_scu_transfer.sub_alias(potential_alias)[source]
zcloud.generic_scu_transfer.sub_aliases_and_validate_config(config_dict)[source]