Order Uploader Documentation
============================

The Order Uploader is a basic CLI application designed to simplify the process of uploading protein sequence data to both cloud storage (GCP) and the Benchling Electronic Lab Notebook (ELN) platform. This tool handles the complexities of data validation, metadata consistency checking, and some error correction to ensure your experimental data is properly archived and accessible for future analysis.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

The Order Uploader processes CSV files containing protein sequence data and associated metadata, performing validation against your organization's schema and metadata standards. It can infer some missing information, correct common typos through fuzzy matching, and organize score data from Scientific Compute Units (SCUs) into appropriately structured tables.

Key Features:

* **Design name generation**: Automatically construct design names from metadata components
* **Fuzzy matching**: Correct typos in metadata fields with user confirmation
* **Score column organization**: Automatically categorize and upload score data to appropriate SCU tables  
* **Interactive error handling**: Guided correction of data inconsistencies
* **Benchling integration**: Upload to Benchling ELN with proper folder organization
* **Monday.com integration**: Validation against project tickets for consistency

Quick Start
-----------

.. note::
   The cluster wrapper is at `/runtime/scripts/order_uploader`, and should be usable from the head node at `order_uploader`

Basic usage requires a CSV file with at minimum ``sequence`` and ``tag_location`` columns and the program ID.
The program ID must be registered in the metadata source of truth with a Benchling Program ID that matches the name of the top level project folder in Benchling.
The Benchling API, at the time of this release, does not support creating top-level project folders, you must use an existing one (or create one manually).

It is recommended to also provide the ``design_name`` column:

.. code-block:: bash

    order_uploader --input-csv-path my_sequences.csv --user-id your.email@ziptx.bio --program-id PROG123

**Example CSV**:

.. csv-table:: Basic CSV format
   :header: "design_name", "sequence", "tag_location"
   :widths: 30, 50, 20

   "IL11_S1_Ic1_001", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFE", "N-term"
   "IL11_S1_Ic1_002", "AKQRQISFVKSHFSRQLEE", "C-term"
   "IL11_S1_Ic1_GSFus003", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSGGGGGGGGGGGGGGGGGGGGGSSSSSSSSS", "N-term"

If you do not provide a ``design_name`` column, or wish to override the design names where relevant, you can provide the metadata as CLI arguments:

.. code-block:: bash

   order_uploader \
     --input-csv-path my_sequences.csv \
     --program-id PROG123 \
     --target-id IL11 \
     --binding-site-id S1 \
     --iteration-number Ic1 \
     --user-id your.email@ziptx.bio \
     --monday-ticket-link "https://ziptx.monday.com/boards/123/items/456"

You can also provide the metadata as columns in your CSV:

.. csv-table:: CSV with metadata columns
   :header: "program_id", "target_id", "binding_site", "iteration", "fusion_id", "design_name", "sequence", "tag_location"
   :widths: 15, 15, 15, 15, 15, 25, 40, 15

   "PROG123", "IL11", "S1", "Ic1", "", "IL11_S1_Ic1_001", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFE", "N-term"
   "PROG123", "IL11", "S1", "Ic1", "", "IL11_S1_Ic1_002", "AKQRQISFVKSHFSRQLEE", "C-term"
   "PROG123", "IL11", "S1", "Ic1", "GSFus", "IL11_S1_Ic1_GSFus003", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSGGGGGGGGGGGGGGGGGGGGGSSSSSSSSS", "N-term"

Please do be aware that conflicts between in-csv and CLI arguments are resolved in favor of CLI if you use the --allow-cli-override flag.
In other cases, the tool will fail unless there is a bug, and ask for resolution.

It is generally recommended to either make the names yourself or use either the CLI or the CSV columns, and not to mix and match, as there may be edge cases where the implementation cannot handle your input.

Additional files
----------------

You can also upload additional files to the order bucket, such as configuration files or supplementary data.
Directory names are supported, and the tool will upload, recursively, all files in any directory path you provide, respecting the directory structure.
Do keep in mind that symlinks might break traversal, and you should avoid them here.

The benchling order will include a top-level pointer to the GCS "folder" that contains the additional files.

.. code-block:: bash

   order_uploader --input-csv-path my_sequences.csv --user-id your.email@ziptx.bio --additional-upload-paths config.json /path/to/analysis_plots/

Usage Scenarios
---------------

Scenario 1: Well-Formatted CSV Upload
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**When to use**: You have a complete, properly formatted CSV with design names, sequences, and tag locations.

**Example CSV** (``complete_order.csv``):

.. csv-table:: Complete order CSV
   :header: "design_name", "sequence", "tag_location"
   :widths: 30, 50, 20

   "IL11_S1_Ic1_001", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQ...", "N-term"

**Command**:

.. code-block:: bash

   order_uploader \
     --input-csv-path complete_order.csv \
     --user-id researcher@ziptx.bio

**What happens**:

1. The tool validates design names against the metadata schema
2. Confirms all required columns are present  
3. Checks for any additional score columns (none in this case)
4. Creates appropriate Benchling folders and uploads data
5. Registers protein entities and generates registry IDs

Scenario 2: CSV Without Design Names
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**When to use**: You have sequences and metadata but need the tool to generate design names.

**Example CSV** (``sequences_only.csv``):

.. csv-table:: Sequences without design names
   :header: "sequence", "tag_location", "program", "target_id", "binding_site", "iteration"
   :widths: 30, 15, 15, 15, 15, 15

   "MKTAYIAKQRQISFVKSHFS...", "N-term", "PROG123", "IL11", "S1", "Ic1"
   "AKQRQISFVKSHFSRQLEE...", "C-term", "PROG123", "IL11", "S1", "Ic1"

This will generate design names like ``IL11_S1_Ic1_001``, ``IL11_S1_Ic1_002``, etc.

**Command**:

.. code-block:: bash

   order_uploader \
     --input-csv-path sequences_only.csv \
     --user-id researcher@ziptx.bio

**What happens**:

1. Tool extracts metadata from CSV columns
2. Validates each metadata component against the schema
3. Generates design names: ``IL11_S1_Ic1_001``, ``IL11_S1_Ic1_002``
4. Proceeds with normal upload workflow

Scenario 3: Conflicting Data Requiring Correction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**When to use**: Your CSV has inconsistencies that need user intervention.

**Example**: CSV with mismatched design names vs. inferred metadata.

**Example CSV** (``conflicted_order.csv``):

.. csv-table:: Conflicted data CSV
   :header: "design_name", "sequence", "tag_location", "target_id"
   :widths: 30, 30, 20, 20

   "IL11_S1_Ic1_001", "MKTAYIAK...", "N-term", "FasR"
   "IL11_S1_Ic1_002", "AKQRQISF...", "C-term", "IL11"

**What happens**:

1. Tool detects target_id mismatch in first row (IL11 in name vs FasR in column)
2. Prompts user with options:

   .. code-block:: text

      Design names clash with expectations based on the input configuration
      WARNING: if you choose (1), the 'raw' csv will be the original (mismatched) one provided at the beginning.
      Unable to find design names. Please select an option: 
      1. Use the generated design names, continue to upload
      2. Save the generated names to file, and exit  
      3. Exit

3. If option 2 is chosen, saves corrected CSV for future use, but does not upload to Benchling.

Scenario 4: Handling Typos with Fuzzy Matching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**When to use**: Your metadata contains typos or slight variations from registered values.

**Example**: Using ``PROG12`` instead of ``PROG123`` or ``IL1`` instead of ``IL11``.

**Command**:

.. code-block:: bash

   order_uploader \
     --input-csv-path order_with_typos.csv \
     --program-id PROG12 \
     --target-id IL1 \
     --user-id researcher@ziptx.bio

**What happens**:

1. Tool fails to find exact match for ``PROG12``
2. Queries metadata tables for fuzzy matches
3. Presents options:

   .. code-block:: text

      PROG12 does not match any registered names. Did you mean PROG123? (y/n):

4. User confirms correction
5. Continues with validated metadata

For multiple fuzzy matches:

.. code-block:: text

   Please select the number of the fuzzy match you meant to check for in program_table: 
   1. PROG123
   2. PROG124  
   3. PROG125
   4. Enter a new value

Scenario 5: CSV with Score Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**When to use**: Your CSV includes additional score columns from Scientific Compute Units (SCUs).

**Example CSV** (``order_with_scores.csv``):

.. csv-table:: CSV with score columns
   :header: "design_name", "sequence", "tag_location", "stability_score", "binding_affinity", "expression_level"
   :widths: 25, 25, 15, 15, 15, 15

   "IL11_S1_Ic1_001", "MKTAYIAK...", "N-term", "0.85", "1.2e-9", "145.2"
   "IL11_S1_Ic1_002", "AKQRQISF...", "C-term", "0.92", "8.7e-10", "162.8"

**What happens**:

1. Tool identifies ``stability_score``, ``binding_affinity``, ``expression_level`` as potential score columns
2. Validates each against SCU schema endpoints
3. Groups validated columns by their SCU table assignments
4. Uploads core sequence data to main table
5. Uploads score columns to appropriate SCU-specific tables (e.g., ``stability_table.csv``, ``binding_table.csv``)

If score columns have typos:

.. code-block:: text

   Unable to find these columns in the registered SCU values: {'expresion_level'}
   WARNING: If you choose a fuzzy match, your score tables uploaded will not match the raw uploaded csv
   Would you like to check for fuzzy matches? (y/n):

User can choose to:

* Accept fuzzy matches with confirmation
* Manually rename columns  
* Keep orphaned columns (uploaded as ``orphaned_score_columns.csv``)

Design Naming Convention
------------------------

ZipTx follows a specific naming convention for protein designs. Understanding this format is crucial for using the Order Uploader effectively.

Format: **Target_Site_Iteration_Number**

Components
~~~~~~~~~~

**Target**
   The target protein identifier (e.g., ``IL11``, ``FasR``, ``IL6``, ``VEGFCD``)

**Site** 
   The binding site identifier, which can be:
   
   * Simple site numbers: ``S1``, ``S2``, ``S3``
   * Complex site identifiers: ``S3ab``, ``S1c``
   * Named sites: ``epitope1``, ``binding_domain``

**Iteration**
   The design iteration following the format ``I{letter}{number}``:
   
   * **Letter component**: 
     
     * ``a-z``: Iteration number for unsuccessful previous designs (``Ia``, ``Ib``, ``Ic``)
     * ``1-9``: Iteration for successful previous designs (``I1``, ``I2``, ``I3``)
   
   * **Number component**: Integer starting from 1, no zero-padding

   **Note**: Once the first successful design iteration is found, all future designs will increment the number and retain the letter, even if a future design is unsuccessful.

**Number**
   Sequential design number within the iteration, zero-padded to 3 digits (``001``, ``002``, ``012``)

Examples
~~~~~~~~

.. list-table:: Design naming examples
   :header-rows: 1
   :widths: 30, 70

   * - Design Name
     - Description
   * - ``IL11_S1_Ia1_001``
     - First iteration (no previous attempts), IL-11 site 1, design 1
   * - ``FasR_S1_Ic1_005``
     - 3rd iteration (2 unsuccessful previous attempts), Site 1 FasR, design 5
   * - ``IL6_S3ab_Ia3_012``
     - 3rd iteration (on previously successful designs), site 3ab in IL-6, design 12

Fusion Proteins
~~~~~~~~~~~~~~~

For designs fused to drug backbones, add the 3-letter drug code before the number:

.. list-table:: Fusion protein naming examples
   :header-rows: 1
   :widths: 30, 70

   * - Design Name
     - Description
   * - ``VEGFC/D_S1_Ia4_Eyl001``
     - VEGFC/D target, site 1, 4th iteration, Eyl drug backbone, design 1

The tool automatically handles fusion protein naming when ``--fusion-id`` is provided or fusion information is included in the CSV.

CLI Reference
-------------

.. autofunction:: zcloud.console_scripts.order_uploader.upload_order

Core Parameters
~~~~~~~~~~~~~~~

``--input-csv-path`` (required)
   Path to your input CSV file containing sequence data.

``--user-id`` 
   Your email address for attribution in Benchling entries.

``--monday-ticket-link``
   URL to the Monday.com ticket for this design campaign. Used for consistency validation.

Metadata Parameters
~~~~~~~~~~~~~~~~~~~

These can be provided via CLI or included as columns in your CSV:

``--program-id``
   The program identifier for this design campaign.

``--target-id`` 
   The target protein identifier.

``--binding-site-id``
   The binding site identifier on the target.

``--fusion-id``
   The fusion construct identifier.

``--iteration-number``
   The design iteration identifier.

Advanced Options
~~~~~~~~~~~~~~~~

``--allow-cli-override``
   Allow CLI arguments to override CSV data when conflicts arise. Default: False.

``--additional-upload-paths``
   Additional files or directories to upload alongside the main data. Useful for configuration files or supplementary data.

   .. code-block:: bash

      --additional-upload-paths config.json analysis_plots/

Required CSV Columns
--------------------

**Mandatory columns**:

* ``sequence``: The protein amino acid sequence
* ``tag_location``: Location of any tags (typically 'N-term', 'C-term', or 'internal')

**Optional columns**:

* ``design_name``: Complete design name following ZipTx convention (generated if missing)
* ``program``, ``target_id``, ``binding_site``, ``fusion_id``, ``iteration``: Metadata components for name generation
* Additional columns are treated as score data and validated against SCU schemas

API Reference
-------------

For developers and power users, key functions include:

Core Validation Functions
~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.benchling_order.check_program_id
.. autofunction:: zcloud.benchling_order.check_target_id  
.. autofunction:: zcloud.benchling_order.check_binding_site_id
.. autofunction:: zcloud.benchling_order.check_fusion_id

Candidate Resolution
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.console_scripts.order_uploader.confirm_single_value_from_user
.. autofunction:: zcloud.console_scripts.order_uploader.confirm_set_of_values_from_user

Fuzzy Matching and Error Handling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.console_scripts.order_uploader.check_metadata_against_oracle_with_fuzzy_find_on_fail
.. autofunction:: zcloud.console_scripts.order_uploader.ask_user_to_confirm_fuzzy_match

SCU and Score Column Processing  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.console_scripts.order_uploader.check_scu_against_oracle_with_fuzzy_find_on_fail
.. autofunction:: zcloud.benchling_order.check_scu_schema

Design Name Generation
~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.benchling_order.pdapply_build_design_names_from_row
.. autofunction:: zcloud.benchling_order.check_generated_design_names

Benchling Integration
~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: zcloud.benchling_order.create_benchling_order_folder
.. autofunction:: zcloud.benchling_order.register_protein_entities
.. autofunction:: zcloud.benchling_order.publish_benchling_entry

Troubleshooting
---------------

Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**"Required columns missing from input CSV"**
   Ensure your CSV includes at minimum ``sequence`` and ``tag_location`` columns.

**"Multiple program IDs found from different sources"**
   Use ``--allow-cli-override`` to force CLI values, or ensure consistency between CSV columns and CLI arguments.

**"Unable to find design names"**  
   Choose option 2 to save generated names to a file, correct your original CSV, and re-run.

**"Max recursion depth reached"**
   Too many correction attempts. Start over with corrected input data.

**Score columns not found in SCU schema**
   Score columns that can't be matched are uploaded as ``orphaned_score_columns.csv``. Consider registering new columns in your SCU schema if they represent valid computed metrics.

**Benchling folder creation failed**
   Check that your program/target/iteration combination is valid and that you have appropriate Benchling permissions.

Environment Setup
~~~~~~~~~~~~~~~~~

Ensure your environment has proper credentials configured for:

* Google Cloud Storage access
* Benchling API authentication  
* Monday.com API access (if using ticket validation)

VMs in the ZipTx cluster are generally configured via Workload Identity Federation, and use the compute engine service account which should be authorized.
When using another machine and something like ``gcloud auth application-default login``, have your admin set your user account with all the necessary permissions.

The tool will attempt to provide specific error messages if authentication fails for any service.

Tips for Success
~~~~~~~~~~~~~~~~

1. **Start simple**: Begin with a minimal CSV (just sequences and tag locations) and let the tool guide you through adding metadata.

2. **Use Monday integration**: Providing ``--monday-ticket-link`` enables consistency checking against your project management system.

3. **Keep raw data**: The tool preserves your original CSV alongside any corrections, ensuring data provenance.

4. **Leverage fuzzy matching**: Don't worry about minor typos UNLESS there are many similar values in the metadata source of truth - the tool's fuzzy matching will help you correct them interactively.

5. **Double check the source of truth**: The tool will check the source of truth for metadata, if your program, target, or binding site is not registered, this tool will fail and not proceed.