Order Uploader Documentation

The Order Uploader is a basic CLI application designed to simplify the process of uploading protein sequence data to both cloud storage (GCP) and the Benchling Electronic Lab Notebook (ELN) platform. This tool handles the complexities of data validation, metadata consistency checking, and some error correction to ensure your experimental data is properly archived and accessible for future analysis.

Overview 

The Order Uploader processes CSV files containing protein sequence data and associated metadata, performing validation against your organization’s schema and metadata standards. It can infer some missing information, correct common typos through fuzzy matching, and organize score data from Scientific Compute Units (SCUs) into appropriately structured tables.

Key Features:

Design name generation: Automatically construct design names from metadata components
Fuzzy matching: Correct typos in metadata fields with user confirmation
Score column organization: Automatically categorize and upload score data to appropriate SCU tables
Interactive error handling: Guided correction of data inconsistencies
Benchling integration: Upload to Benchling ELN with proper folder organization
Monday.com integration: Validation against project tickets for consistency

Quick Start 

Note

The cluster wrapper is at /runtime/scripts/order_uploader, and should be usable from the head node at order_uploader

Basic usage requires a CSV file with at minimum sequence and tag_location columns and the program ID. The program ID must be registered in the metadata source of truth with a Benchling Program ID that matches the name of the top level project folder in Benchling. The Benchling API, at the time of this release, does not support creating top-level project folders, you must use an existing one (or create one manually).

It is recommended to also provide the design_name column:

order_uploader --input-csv-path my_sequences.csv --user-id your.email@ziptx.bio --program-id PROG123

Example CSV:

Basic CSV format
design_name	sequence	tag_location
IL11_S1_Ic1_001	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFE	N-term
IL11_S1_Ic1_002	AKQRQISFVKSHFSRQLEE	C-term
IL11_S1_Ic1_GSFus003	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSGGGGGGGGGGGGGGGGGGGGGSSSSSSSSS	N-term

If you do not provide a design_name column, or wish to override the design names where relevant, you can provide the metadata as CLI arguments:

order_uploader \
  --input-csv-path my_sequences.csv \
  --program-id PROG123 \
  --target-id IL11 \
  --binding-site-id S1 \
  --iteration-number Ic1 \
  --user-id your.email@ziptx.bio \
  --monday-ticket-link "https://ziptx.monday.com/boards/123/items/456"

You can also provide the metadata as columns in your CSV:

CSV with metadata columns
program_id	target_id	binding_site	iteration	fusion_id	design_name	sequence	tag_location
PROG123	IL11	S1	Ic1		IL11_S1_Ic1_001	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFE	N-term
PROG123	IL11	S1	Ic1		IL11_S1_Ic1_002	AKQRQISFVKSHFSRQLEE	C-term
PROG123	IL11	S1	Ic1	GSFus	IL11_S1_Ic1_GSFus003	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSGGGGGGGGGGGGGGGGGGGGGSSSSSSSSS	N-term

Please do be aware that conflicts between in-csv and CLI arguments are resolved in favor of CLI if you use the –allow-cli-override flag. In other cases, the tool will fail unless there is a bug, and ask for resolution.

It is generally recommended to either make the names yourself or use either the CLI or the CSV columns, and not to mix and match, as there may be edge cases where the implementation cannot handle your input.

Additional files 

You can also upload additional files to the order bucket, such as configuration files or supplementary data. Directory names are supported, and the tool will upload, recursively, all files in any directory path you provide, respecting the directory structure. Do keep in mind that symlinks might break traversal, and you should avoid them here.

The benchling order will include a top-level pointer to the GCS “folder” that contains the additional files.

order_uploader --input-csv-path my_sequences.csv --user-id your.email@ziptx.bio --additional-upload-paths config.json /path/to/analysis_plots/

Usage Scenarios 

Scenario 1: Well-Formatted CSV Upload 

When to use: You have a complete, properly formatted CSV with design names, sequences, and tag locations.

Example CSV (complete_order.csv):

Complete order CSV
design_name	sequence	tag_location
IL11_S1_Ic1_001	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQ…	N-term

Command:

order_uploader \
  --input-csv-path complete_order.csv \
  --user-id researcher@ziptx.bio

What happens:

The tool validates design names against the metadata schema
Confirms all required columns are present
Checks for any additional score columns (none in this case)
Creates appropriate Benchling folders and uploads data
Registers protein entities and generates registry IDs

Scenario 2: CSV Without Design Names 

When to use: You have sequences and metadata but need the tool to generate design names.

Example CSV (sequences_only.csv):

Sequences without design names
sequence	tag_location	program	target_id	binding_site	iteration
MKTAYIAKQRQISFVKSHFS…	N-term	PROG123	IL11	S1	Ic1
AKQRQISFVKSHFSRQLEE…	C-term	PROG123	IL11	S1	Ic1

This will generate design names like IL11_S1_Ic1_001, IL11_S1_Ic1_002, etc.

Command:

order_uploader \
  --input-csv-path sequences_only.csv \
  --user-id researcher@ziptx.bio

What happens:

Tool extracts metadata from CSV columns
Validates each metadata component against the schema
Generates design names: IL11_S1_Ic1_001, IL11_S1_Ic1_002
Proceeds with normal upload workflow

Scenario 3: Conflicting Data Requiring Correction 

When to use: Your CSV has inconsistencies that need user intervention.

Example: CSV with mismatched design names vs. inferred metadata.

Example CSV (conflicted_order.csv):

Conflicted data CSV
design_name	sequence	tag_location	target_id
IL11_S1_Ic1_001	MKTAYIAK…	N-term	FasR
IL11_S1_Ic1_002	AKQRQISF…	C-term	IL11

What happens:

Tool detects target_id mismatch in first row (IL11 in name vs FasR in column)

Prompts user with options:

Design names clash with expectations based on the input configuration
WARNING: if you choose (1), the 'raw' csv will be the original (mismatched) one provided at the beginning.
Unable to find design names. Please select an option:
1. Use the generated design names, continue to upload
2. Save the generated names to file, and exit
3. Exit

If option 2 is chosen, saves corrected CSV for future use, but does not upload to Benchling.

Scenario 4: Handling Typos with Fuzzy Matching 

When to use: Your metadata contains typos or slight variations from registered values.

Example: Using PROG12 instead of PROG123 or IL1 instead of IL11.

Command:

order_uploader \
  --input-csv-path order_with_typos.csv \
  --program-id PROG12 \
  --target-id IL1 \
  --user-id researcher@ziptx.bio

What happens:

Tool fails to find exact match for PROG12
Queries metadata tables for fuzzy matches

Presents options:

PROG12 does not match any registered names. Did you mean PROG123? (y/n):

User confirms correction
Continues with validated metadata

For multiple fuzzy matches:

Please select the number of the fuzzy match you meant to check for in program_table:
PROG123
PROG124
PROG125
Enter a new value

Scenario 5: CSV with Score Columns 

When to use: Your CSV includes additional score columns from Scientific Compute Units (SCUs).

Example CSV (order_with_scores.csv):

CSV with score columns
design_name	sequence	tag_location	stability_score	binding_affinity	expression_level
IL11_S1_Ic1_001	MKTAYIAK…	N-term	0.85	1.2e-9	145.2
IL11_S1_Ic1_002	AKQRQISF…	C-term	0.92	8.7e-10	162.8

What happens:

Tool identifies stability_score, binding_affinity, expression_level as potential score columns
Validates each against SCU schema endpoints
Groups validated columns by their SCU table assignments
Uploads core sequence data to main table
Uploads score columns to appropriate SCU-specific tables (e.g., stability_table.csv, binding_table.csv)

If score columns have typos:

Unable to find these columns in the registered SCU values: {'expresion_level'}
WARNING: If you choose a fuzzy match, your score tables uploaded will not match the raw uploaded csv
Would you like to check for fuzzy matches? (y/n):

User can choose to:

Accept fuzzy matches with confirmation
Manually rename columns
Keep orphaned columns (uploaded as orphaned_score_columns.csv)

Design Naming Convention 

ZipTx follows a specific naming convention for protein designs. Understanding this format is crucial for using the Order Uploader effectively.

Format: Target_Site_Iteration_Number

Components 

Target

The target protein identifier (e.g., IL11, FasR, IL6, VEGFCD)

Site

The binding site identifier, which can be:

Simple site numbers: S1, S2, S3
Complex site identifiers: S3ab, S1c
Named sites: epitope1, binding_domain

Iteration

The design iteration following the format I{letter}{number}:

Letter component:
- a-z: Iteration number for unsuccessful previous designs (Ia, Ib, Ic)
- 1-9: Iteration for successful previous designs (I1, I2, I3)
Number component: Integer starting from 1, no zero-padding

Note: Once the first successful design iteration is found, all future designs will increment the number and retain the letter, even if a future design is unsuccessful.

Number

Sequential design number within the iteration, zero-padded to 3 digits (001, 002, 012)

Examples 

Design naming examples
Design Name	Description
`IL11_S1_Ia1_001`	First iteration (no previous attempts), IL-11 site 1, design 1
`FasR_S1_Ic1_005`	3rd iteration (2 unsuccessful previous attempts), Site 1 FasR, design 5
`IL6_S3ab_Ia3_012`	3rd iteration (on previously successful designs), site 3ab in IL-6, design 12

Fusion Proteins 

For designs fused to drug backbones, add the 3-letter drug code before the number:

Fusion protein naming examples
Design Name	Description
`VEGFC/D_S1_Ia4_Eyl001`	VEGFC/D target, site 1, 4th iteration, Eyl drug backbone, design 1

The tool automatically handles fusion protein naming when --fusion-id is provided or fusion information is included in the CSV.

CLI Reference 

zcloud.console_scripts.order_uploader.upload_order(*args, **kwargs)

Upload an order to Benchling, validate and/or generate the sequence names and score columns.

If the input CSV sequence_name column is well formatted and follows schema, then all you need is the csv of: design_name, sequence, tag_location

You can also omit the name, as long as you provide either a column with the appropriate program id, binder id, binding site id, etc. You may also provide missing data as an argument, which will override the data in the csv (and will regenerate the names).

Parameters:

input_csv_path (str) – The path to the input CSV file.
program_id (Optional[str]) – The ID of the program.
target_id (Optional[str]) – The ID of the target.
binding_site_id (Optional[str]) – The ID of the binding site.
user_id (Optional[str]) – The ID of the user.
monday_ticket_link (Optional[str]) – The link to the Monday ticket.
iteration_number (Optional[str]) – The iteration number.
fusion_id (Optional[str]) – The ID of the fusion.
additional_upload_paths (Tuple[str, ...]) – Additional paths to upload, can be used to upload any files or directories recursively to the order bucket. Useful for if you used an unusual config and want to record it.
allow_cli_override (bool) – Allow CLI arguments to override the data in the input CSV, default is to fail and complain.

Raises:

FileNotFoundError – If the input CSV file is not found.
ValueError – If required columns are missing or if there are validation errors.
SystemExit – If user chooses to exit during interactive prompts.

Core Parameters 

--input-csv-path (required): Path to your input CSV file containing sequence data.
--user-id: Your email address for attribution in Benchling entries.
--monday-ticket-link: URL to the Monday.com ticket for this design campaign. Used for consistency validation.

Metadata Parameters 

These can be provided via CLI or included as columns in your CSV:

--program-id: The program identifier for this design campaign.
--target-id: The target protein identifier.
--binding-site-id: The binding site identifier on the target.
--fusion-id: The fusion construct identifier.
--iteration-number: The design iteration identifier.

Advanced Options 

--allow-cli-override

Allow CLI arguments to override CSV data when conflicts arise. Default: False.

--additional-upload-paths

Additional files or directories to upload alongside the main data. Useful for configuration files or supplementary data.

--additional-upload-paths config.json analysis_plots/

Required CSV Columns 

Mandatory columns:

sequence: The protein amino acid sequence
tag_location: Location of any tags (typically ‘N-term’, ‘C-term’, or ‘internal’)

Optional columns:

design_name: Complete design name following ZipTx convention (generated if missing)
program, target_id, binding_site, fusion_id, iteration: Metadata components for name generation
Additional columns are treated as score data and validated against SCU schemas

API Reference 

For developers and power users, key functions include:

Core Validation Functions 

zcloud.benchling_order.check_program_id(program_id_query, eval_records=None, try_to_find_monday_id=None)[source]

Validate program ID against metadata oracle.

Parameters:

program_id_query (str) – Program ID to validate.
eval_records (Optional[List[Dict[str, str]]], optional) – Pre-loaded records to validate against instead of making API call, by default None.
try_to_find_monday_id (Optional[str], optional) – Monday ID to try to match against, by default None.

Returns:

Tuple containing (program_id_benchling, program_id_design, program_id_monday).

Return type:

Tuple[str, str, str]

Raises:

ValueError – If program ID cannot be found in the metadata.

zcloud.benchling_order.check_target_id(target_id_query, allowed_other_ids=None, eval_records=None, try_to_find_monday_id=None)[source]

Validate target ID against metadata oracle.

Parameters:

target_id_query (str) – Target ID to validate.
allowed_other_ids (Optional[List[str]], optional) – List of allowed program IDs for cross-validation, by default None.
eval_records (Optional[List[Dict[str, str]]], optional) – Pre-loaded records to validate against instead of making API call, by default None.
try_to_find_monday_id (Optional[str], optional) – Monday ID to try to match against, by default None.

Returns:

Tuple containing (matching_program_id, matching_target_id_benchling, matching_target_id_design, matching_target_id_monday, matching_target_id_internal).

Return type:

Tuple[str, str, str, str, str]

Raises:

UnableToFindMetadataError – If target ID cannot be found in the metadata.

zcloud.benchling_order.check_binding_site_id(binding_site_query, allowed_other_ids=None, eval_records=None)[source]

Validate binding site ID against metadata oracle.

Parameters:

binding_site_query (str) – Binding site ID to validate.
allowed_other_ids (Optional[List[str]], optional) – List of allowed target IDs for cross-validation, by default None.
eval_records (Optional[List[Dict[str, str]]], optional) – Pre-loaded records to validate against instead of making API call, by default None.

Returns:

Tuple containing (matching_target_id, binding_site_id_benchling, binding_site_id_design).

Return type:

Tuple[str, str, str]

Raises:

UnableToFindMetadataError – If binding site ID cannot be found in the metadata.

zcloud.benchling_order.check_fusion_id(fusion_id_query, eval_records=None)[source]

Validate a fusion ID against the metadata validator.

Parameters:

fusion_id_query (str) – Fusion ID to validate.
eval_records (Optional[List[Dict[str, str]]], optional) – Pre-loaded records to validate against instead of making API call, by default None.

Returns:

Tuple containing (fusion_id_internal, fusion_id_benchling, fusion_id_design).

Return type:

Tuple[str, str, str]

Raises:

UnableToFindMetadataError – If fusion ID cannot be found in the metadata.

Candidate Resolution 

zcloud.console_scripts.order_uploader.confirm_single_value_from_user(value_type, df, cli_value, allow_cli_override)[source]

Confirm and resolve a single value from user input or DataFrame.

This function attempts to resolve a value from the DataFrame first, and if that fails, prompts the user for input or uses CLI override values.

Parameters:

value_type (str) – The type of value to resolve (e.g., ‘program_id’, ‘target_id’).
df (pd.DataFrame) – The input DataFrame containing the data to analyze.
cli_value (Optional[str]) – Optional CLI-provided value to use as override.
allow_cli_override (bool) – Whether to allow CLI values to override DataFrame values.

Returns:

The resolved value for the specified type.

Return type:

str

Raises:

SystemExit – If user chooses to exit during the confirmation process.

zcloud.console_scripts.order_uploader.confirm_set_of_values_from_user(value_type, df, cli_value, allow_cli_override)[source]

Confirm and resolve a set of values from user input or DataFrame.

This function attempts to resolve values from the DataFrame first, and if that fails, prompts the user for input or uses CLI override values.

Parameters:

value_type (str) – The type of value to resolve (e.g., ‘binding_site_id’, ‘fusion_id’).
df (pd.DataFrame) – The input DataFrame containing the data to analyze.
cli_value (Optional[str]) – Optional CLI-provided value to use as override.
allow_cli_override (bool) – Whether to allow CLI values to override DataFrame values.

Returns:

A set of resolved values for the specified type.

Return type:

Set[str]

Raises:

SystemExit – If user chooses to exit during the confirmation process.

Fuzzy Matching and Error Handling 

zcloud.console_scripts.order_uploader.check_metadata_against_oracle_with_fuzzy_find_on_fail(metadata_table_id, query_value, allowed_other_ids=None, try_to_find_monday_id=None, fuzzy_match_threshold=70, _recursion_depth=0)[source]

Check metadata value against oracle with fuzzy matching fallback.

Attempts to validate a metadata value against the oracle database. If the exact match fails, falls back to fuzzy matching and user confirmation. Uses recursion to handle multiple validation attempts.

Parameters:

metadata_table_id (str) – The ID of the metadata table to check against.
query_value (str) – The value to validate.
allowed_other_ids (Optional[List[str]], optional) – Additional IDs that are allowed for validation, by default None.
try_to_find_monday_id (Optional[str], optional) – Monday ID to try to match against, by default None.
fuzzy_match_threshold (int, optional) – Threshold for fuzzy matching (0-100), by default 70.
_recursion_depth (int, optional) – Internal recursion depth counter, by default 0.

Returns:

A tuple containing the validated metadata information. The exact structure depends on the metadata checker function used.

Return type:

Tuple[Any, …]

Raises:

SystemExit – If maximum recursion depth is reached.
ValueError – If there are errors retrieving metadata tables.

zcloud.console_scripts.order_uploader.ask_user_to_confirm_fuzzy_match(query_val, fuzzy_matches, metadata_table_id)[source]

Ask the user to confirm a fuzzy match selection from a list of candidates.

Parameters:

query_val (str) – The original query value that did not match exactly.
fuzzy_matches (Iterable[str]) – A list of fuzzy match candidates.
metadata_table_id (str) – The ID of the metadata table being queried.

Returns:

The value selected by the user (either from fuzzy matches or a new value).

Return type:

str

Raises:

SystemExit – If user chooses to exit during the selection process.

SCU and Score Column Processing 

zcloud.console_scripts.order_uploader.check_scu_against_oracle_with_fuzzy_find_on_fail(score_columns_to_check, _recursion_depth=0, all_tables=None, keymap=None)[source]

Check score columns against SCU oracle with fuzzy matching fallback.

Validates score column names against the SCU (Score Column Unit) schema. If exact matches fail, provides fuzzy matching and user interaction to resolve column names. Handles orphaned columns that cannot be matched.

Parameters:

score_columns_to_check (Set[str]) – Set of score column names to validate.
_recursion_depth (int, optional) – Internal recursion depth counter, by default 0.
all_tables (Optional[Dict[str, List[Dict[str, str]]]], optional) – Cached SCU tables data to avoid repeated API calls, by default None.
keymap (Optional[Dict[str, str]], optional) – Mapping of original column names to corrected names, by default None.

Returns:

A tuple containing: - Dictionary mapping table IDs to lists of found field names - Set of orphaned score columns that couldn’t be matched - Dictionary mapping original column names to corrected names

Return type:

Tuple[Dict[str, List[str]], Set[str], Dict[str, str]]

Raises:

SystemExit – If maximum recursion depth is reached.
ValueError – If there are errors retrieving SCU tables.

zcloud.benchling_order.check_scu_schema(query_dict, all_tables=None)[source]

Check Scientific Compute Unit (SCU) schema against available tables.

Parameters:

query_dict (Dict[str, str]) – Dictionary containing query parameters with field information.
all_tables (Optional[Dict[str, List[Dict[str, str]]]], optional) – Pre-loaded table data to avoid API calls. If None, will make API call.

Returns:

Dictionary mapping table IDs to lists of found field names.

Return type:

Dict[str, List[str]]

Raises:

ValueError – If SCU validation fails with non-200 status code.

Design Name Generation 

zcloud.benchling_order.pdapply_build_design_names_from_row(row, iteration, target_id, binding_site_id=None, fusion_id=None, override=False)[source]

Build a design name from a row, intended for pandas apply.

Parameters:

row (pd.Series) – Pandas Series representing a row of data.
iteration (str) – Iteration code to use in the design name.
target_id (str) – Target ID to use in the design name.
binding_site_id (Optional[str], optional) – Binding site ID to use. If None, will try to infer from row data.
fusion_id (Optional[str], optional) – Fusion ID to use. If None, will try to infer from row data.
override (bool, optional) – If True, use provided parameters directly. If False, try to infer missing values.

Returns:

Generated design name in the format target_id_binding_site_iteration_fusion_id###.

Return type:

str

zcloud.benchling_order.check_generated_design_names(df, generated_design_names, allow_cli_override=False)[source]

Check if generated design names match existing design names in DataFrame.

Parameters:

df (pd.DataFrame) – Input DataFrame containing existing design names.
generated_design_names (pd.Series) – Series of generated design names to compare against.
allow_cli_override (bool, optional) – Whether to allow override and use generated names when there’s a mismatch, by default False.

Raises:

UnableToFindMetadataError – If design names don’t match and override is not allowed.

Return type:

None

Benchling Integration 

zcloud.benchling_order.create_benchling_order_folder(program_id, target_id, iteration)[source]

Create a new folder in Benchling for the order.

Parameters:

program_id (str) – The program ID for the order.
target_id (str) – The target ID for the order.
iteration (str) – The iteration number for the order.

Returns:

Response containing folder creation details including registry_folder_id and iteration_folder_id.

Return type:

Dict

zcloud.benchling_order.register_protein_entities(protein_registry_folder_id, small_table_data)[source]

Register protein entities in Benchling.

Parameters:

protein_registry_folder_id (str) – The folder ID in Benchling where proteins should be registered.
small_table_data (List[Dict[str, str]]) – List of dictionaries containing protein data to register.

Returns:

Response containing registration details including aaSequences with entity registry IDs.

Return type:

Dict

zcloud.benchling_order.publish_benchling_entry(benchling_entry_query_dict)[source]

Create a new entry in Benchling.

Parameters:: benchling_entry_query_dict (Dict) – Dictionary containing entry data including sequence records, CSV data, entry name, GCS bucket path, author email, Monday ticket URL, and iteration folder ID.
Returns:: Response from the Benchling service indicating success or failure of entry creation.
Return type:: Dict

Troubleshooting 

Common Issues and Solutions 

“Required columns missing from input CSV”: Ensure your CSV includes at minimum sequence and tag_location columns.
“Multiple program IDs found from different sources”: Use --allow-cli-override to force CLI values, or ensure consistency between CSV columns and CLI arguments.
“Unable to find design names”: Choose option 2 to save generated names to a file, correct your original CSV, and re-run.
“Max recursion depth reached”: Too many correction attempts. Start over with corrected input data.
Score columns not found in SCU schema: Score columns that can’t be matched are uploaded as orphaned_score_columns.csv. Consider registering new columns in your SCU schema if they represent valid computed metrics.
Benchling folder creation failed: Check that your program/target/iteration combination is valid and that you have appropriate Benchling permissions.

Environment Setup 

Ensure your environment has proper credentials configured for:

Google Cloud Storage access
Benchling API authentication
Monday.com API access (if using ticket validation)

VMs in the ZipTx cluster are generally configured via Workload Identity Federation, and use the compute engine service account which should be authorized. When using another machine and something like gcloud auth application-default login, have your admin set your user account with all the necessary permissions.

The tool will attempt to provide specific error messages if authentication fails for any service.

Tips for Success 

Start simple: Begin with a minimal CSV (just sequences and tag locations) and let the tool guide you through adding metadata.
Use Monday integration: Providing --monday-ticket-link enables consistency checking against your project management system.
Keep raw data: The tool preserves your original CSV alongside any corrections, ensuring data provenance.
Leverage fuzzy matching: Don’t worry about minor typos UNLESS there are many similar values in the metadata source of truth - the tool’s fuzzy matching will help you correct them interactively.
Double check the source of truth: The tool will check the source of truth for metadata, if your program, target, or binding site is not registered, this tool will fail and not proceed.