CyVerse: Cyberinfrastructure for open science

CyVerse, the largest publicly-funded open-source research cyberinfrastructure for life sciences, has played a crucial role in advancing data-driven research since the 2010s. As the technology landscape evolved with the emergence of cloud computing platforms, machine learning and artificial intelligence (AI) applications, CyVerse has enabled access by providing interfaces, Software as a Service (SaaS), and cloud-native Infrastructure as Code (IaC) to leverage new technologies. CyVerse services enable researchers to integrate institutional and private computational resources, custom software, perform analyses, and publish data in accordance with open science principles. Over the past 13 years, CyVerse has registered more than 124,000 verified accounts from 160 countries and was used for over 1,600 peer-reviewed publications. Since 2011, 45,000 students and researchers have been trained to use CyVerse. The platform has been replicated and deployed in three countries outside the US, with additional private deployments on commercial clouds for US government agencies and multinational corporations. In this manuscript, we present a strategic blueprint for creating and managing SaaS cyberinfrastructure and IaC as free and open-source software.

CyVerse's foundational service Atmosphere.A hybrid cloud service is operated using HTCondor [46] and Kubernetes [43,113,114] for the DE's executable and interactive apps.

High Performance Computing
CyVerse has partnered with multiple organizations within the NSF-supported XSEDE [104] (now) ACCESS-CI to connect users with HPC resources (Fig A).The DE framework allows researchers to seamlessly launch jobs on HPC.In practice, jobs are launched at TACC, the San Diego Supercomputer Center (SDSC), National Center for Supercomputing Applications (NCSA), and UArizona HPC.At TACC, CyVerse leverages the Tapis v3 [107,115] web-based API framework for securely managing computational workloads across infrastructure and institutions.

High Throughput Computing
High Throughput Computing (HTC) describes multiple simultaneous processes [jobs] which run in parallel or sequentially across many computational processors [cores].Examples of HTC workflows include genome assemblies and processing and reconstruction of signals and images.HTCondor integration with the OSG [46] in the DE allows researchers to launch jobs with tens to hundreds of simultaneous processes [jobs] across the entire OSG framework (Fig A ). CyVerse hosts nodes as part of the OSG pool.CyVerse DE helps to reduce the complexity of using HTCondor by providing GUI-based tools and templates for users to design their own workflows which run on HTC.CyVerse resources can also connect to the new PaTH project.

Data Storage
Data are stored across multiple resource servers at UArizona and at TACC which coordinate the management of over 8 PB of user contributed data in an Integrated Rule Oriented Data System (iRODS) [116] "Data Store".Data are replicated (mirrored) nightly between UArizona and TACC's Corral [117], a petascale storage and data management resource.OpenStack cloud services nodes and DE allocated processing nodes hold temporary (scratch) data while they are in use.During an analysis in the DE or on OpenStack, data can be moved anywhere across the internet or copied back to the Data Store when the analyses are completed.

Databases
CyVerse operates multiple databases within its infrastructure.These include a list of all registered users and their host institutions, the Data Store iRODS database for user and community file storage, metadata required by the DE's applications and tools, as well as a PostgreSQL database [118,119] which makes file and folder metadata and contents queryable in the DE.CyVerse uses ElasticSearch [120] for indexing and searching of data in the Data Store.

Foundational Services
CyVerse Foundational Services (Main Text Fig 2) provide the linkage between end-user platforms and hardware resources.These services are often referred to as 'middleware' and serve as the glue holding the rest of cyberinfrastructure together.CyVerse provides a federated authentication service for its users, a user portal where platforms and services can be requested, an API for launching resources in the DE from 3rd party platforms, cloud services for managing virtual machines and clusters, and researcher support services called "Powered by CyVerse" which leverage one or more of these resources.

Authentication and Security
CyVerse services use Central Authentication Service (CAS) [121] and KeyCloak [122] which operate on the OAuth2.0standard internet protocol [123] (Fig B ).After creating an account, users authenticate as a single sign-on service in their internet browser (Fig C).Users are encouraged to provide their academic, government, or organizational email address and ORCID (Open Researcher and Contributor IDentifier) [124] when creating profiles.User information is private, in accordance with European Union General Data Protection Regulations (EU-GDPR) [125].Users can authenticate through KeyCloak using a CILogon [126], GitHub, Globus [127], or Google account.Once authenticated to CyVerse, only the authenticated user can access the secure Uniform Resource Locators (URL) (commonly known as 'web addresses') for the featured Platforms.Private URLs to running applications can be shared with other users through the DE interface after they have been started.where they submit their credentials through either KeyCloak (gray) and CILogon (green).These authentication requests are accepted by OAuth2.0 (black) and returned.UML template adapted from GitHub user JMBarbier (https://github.com/jmbarbier).

User Portal
Through the User Portal (Fig D ), new users can create and manage their account, request access to featured platforms, and request to schedule workshops.The User Portal interface also provides hyperlinks to all the CyVerse platform's featured Services.
Requests for access to featured services, platforms, workshops, and community data released folders are reviewed by CyVerse staff.Requests for federated CyVerse services, 'Powered by CyVerse' as a 3rd party platform projects, or replicated CyVerse deployments, are sent through the ticketing system (Intercom.io).Requestors are contacted directly by CyVerse leadership to begin discussions and contracting agreements.

Terrain API
The DE uses JSON [128] for managing its REST API service called "Terrain."Terrain API serves as the backbone to the DE and can be used outside of CyVerse's featured platforms over a Swagger RESTful API [128,129].This allows users to build or define tools and workflows to their own services.CyVerse software engineers connect 3rd party projects to the Terrain API as part of the Powered by CyVerse feature.At TACC, the Tapis web service APIs support HPC jobs requested by DE users by marshaling data to and from the CyVerse Data Store systems, while also resiliently managing job submission and lifecycle on XSEDE and other HPC providers.

Cloud-Native Services
CyVerse has developed and maintained an OpenStack cloud service [111,112] called 'Atmosphere' for the last 10 years [130].The Atmosphere services were expanded and made available via 'Jetstream' (NSF award OAC 1445604) in 2016 [74] and again in 2021 with Jetstream2 (NSF award OAC 2005506 ).Atmosphere abstracts numerous complex operations required to manage and launch virtual machines in an OpenStack cloud, thus providing an easy to use interface for researchers through the browser.Featured base images of common Linux operating systems and Graphic User Interface (GUI) desktops provide users a workspace on which they can rapidly compile scientific software, run services, and analyze data.
As containers have become the dominant modality for software virtualization, the need for full deployment of virtual machines with base OS images and administrator access has changed.Users now bring their own precompiled containers of preferred operating systems and scientific software stacks into the cloud.In the second generation of CyVerse, the Cloud-Native Services team has incorporated Kubernetes (K8s) [43], Lightweight Kubernetes (Rancher K3s) [131], and Argo workflows [42] for orchestrating containers in its platforms.Orchestration and job scheduling allows CyVerse to simultaneously manage hundreds of users running interactive environments (i.e., while thousands of jobs are run across multiple computational platforms (HTCondor, HTC, HPC).

Platform Products
The CyVerse Data Store is part of the foundational services offered by the cyberinfrastructure.The DE enables users to run analyses and access the Data Store.CyVerse's focus on cloud has evolved from managing OpenStack instances via its Atmosphere client toward a continuous analysis platform which functions as Cloud-Native Services.

Data Store
The Data Store utilizes iRODS [116] running across a distributed array of storage nodes located at UArizona.for expected transfer duration by file size in seconds.Internal transfers between iRODS storage and computing nodes vary between 25-300 MB/s, depending on the storage type of the compute nodes (i.e., solid state drives [SSD] vs spinning disk hard drives).User data can be managed through CyVerse browser-based platforms including the DE and BisQue, or through terminal-based software such as iCommands and iRODS FUSE Lite.Data can also be uploaded or downloaded using third party software such as FileZilla, CyberDuck, and standard file browsers like Windows File Explorer.
The iRODS Data Store uses a conventional three-tiered Linux permission system (i.e., 'read', 'write', and 'own') for files and folders.Data is shared internally with iRODS by adding permissions to individual objects or collections with other CyVerse users' private names, or with 'teams' made up of grouped usernames administered by a project owner.Data can also be shared with the entire CyVerse user base by adding the 'public' username group, or the open internet by adding the 'anonymous' username group.The CyVerse metadata database is the primary repository for DE metadata storage.Based on AVU-triples (Attributes-Values-Units) foundation of iRODS metadata database, the CyVerse metadata database allows an unlimited number of AVU combinations.For example, users can use the same attribute with more than one value or more than one type of unit, significantly expanding the degree of metadata that can be stored for data analysis and retrieval.AVUs are exposed to web crawlers, like schema.orgwhen they are shared with the 'anonymous' username group.These metadata allow both files and folders to become searchable using common search engines, i.e., Google or Bing.
Metadata templates for common metadata standards, e.g., the Dublin Core https://www.dublincore.org/and DataCite [140,141] provide pre-configured formats for users to apply standard AVUs to their data files and folders in the Discovery  Environment.The AVUs are visible publicly via the Data Commons website and programmatically through iRODS and its APIs.
WebDAV [132,133] is an extension of the HTTP communication protocol that allows users to collaboratively manage files and folders stored remotely.The CyVerse WebDAV service provides a TLS encrypted WebDAV interface to the Data Store.It complies with the WebDAV Class 2 standard, meaning it supports standard file browser features like file reading/downloading, file writing/uploading, folder creation, etc.; plus it supports multi-user access features like file locking.A user may navigate the folder hierarchy and view data through a common web browser using this service.They may also use a common file browser like Windows File Explorer, MacOS X Finder, or any tool that understands WebDAV, to work with CyVerse files and folders as if they were local.Since WebDAV is a standard,open protocol with significant library support, a user may interact with the service programmatically.The service respects Data Store data access controls.For a user to access data that is not anonymously available, the user must authenticate using CyVerse credentials.Files downloaded through this service do not have the same tracking or internal analytics as files moved or downloaded using iRODS iCommands.However, downloading data sets consisting of many small files through this service can be many times faster than directly through iRODS due to caching.File caching is accomplished through an internal Varnish caching service [134].For common use cases, the cache service has reduced data access times by 75%.
DataWatch API triggers code or runs workflows when specific "data events" take place.An example of this is when a specific file type is uploaded to a specific folder, or when an analysis completes, and its results are returned to the Data Store.DataWatch enables email notifications at pre-specified data events and will work in concert with event-driven webhooks to utilities or URLs.
Data Commons hosts both curated and community released data.The Data Commons can be used to publish (DataCite DOI) data for which there are no other existing canonical repositories or for which hosting on other research data services would not be feasible, such as very large datasets or data that need to be linked to analysis tools on CyVerse.Where appropriate, CyVerse encourages publication to canonical data repositories such as the National Center for Biotechnology Information (NCBI) [135][136][137].CyVerse also provides tools for publishing sequence data directly to NCBI's Sequence Read Archive (SRA) [138,139].Curated data in the Data Commons are published via DataCite [140,141] and receive a DOI.In addition to publication of static datasets with a DOI, shared data can exist as writable (editable) archives that can be modified by their owners in the Data Store's 'Community Released' projects.Community released data folders can be shared publicly as 'read-only' files over WebDav (https://data.cyverse.org)and the Data Commons (https://datacommons.cyverse.org)once shared with the 'anonymous' username group.Users can request DataCite DOI publication through the User Portal or by 'Publishing' their data in the Discovery Environment.Metadata are applied to community released and curated folders using the Discovery Environment's metadata template AVUs.The steps for requesting either a Community Released folder or a Curated folder are described in the user documentation (https://learning.cyverse.org/ds/doi/).Every Curated DOI requestor must complete the DataCite metadata template with required fields in the Discovery Environment.The template is submitted for review where CyVerse DOI experts communicates with the authors ensuring that all fields are complete and meet the DataCite standards.Once the DOI has been granted all write and own permissions are removed from the folder and it is transferred to the '/curated' folder space as 'read-only'.
Federated Storage, CyVerse allows its community to federate their own resource storage servers from their institutions or from within CyVerse facilities.The new storage servers are added as 'resources' to the CyVerse iRODS zone, and can be kept accessible to only those community owners.These additional data storage devices can have their data replicated (mirrored) at UArizona and at TACC.

Discovery Environment
The Discovery Environment (DE) is a multi-function 'data science workbench' with numerous applications designed for accessibility in the browser [142,143].The DE user interface features a table of contents on the left side which provides access to the user's data space and community data, to applications and their integration, and to running analyses  The DE uses the concept of "Applications" for "Apps" which provide UI fields for input file paths, directories, or for abstracting command line interface (CLI) parameter fields.The "Tools" are for bringing-your-own-containers to the workbench.A "Tool" is a metadata template with information about a public Docker image: its metadata description, attribution, version, as well as cached public registry location and tag name.The Tool can (re)set the image's working directory, open ports, and change its entrypoint.Once these Tool parameters are established, an "App" can be created which will use a specific tool.Multiple types of FOSS programs can be integrated into the DE.These programs are defined as executable, interactive, high-throughput, or high-performance "Tools" [143] (S7 Table ).Jobs are run on numerous different infrastructures, which are physically located at UArizona (CyVerse), at TACC, or on the Open Science Grid.Each application or "App" is managed by a different type of scheduler or job handler depending on its type.Docker images from public container registries, i.e., DockerHub [152], GitHub Container Registry [153], QUAY.io [154], BioContainers.pro [155], NVIDIA GPU Cloud [156], can be integrated.When users wish to publish a DE App with the community, the Tool image is reviewed by CyVerse staff.Once approved, it is added to the Harbor [157]

private container registry (Fig H). Public tools are cached on the DE's processing nodes and in a public/private Harbor registry maintained by CyVerse (Fig I).
This enables dramatically faster launches, particularly of large containers typical of data science applications (e.g.JupyterLab with numerous Data Science Python libraries [48,158], RStudio TidyVerse or Geospatial [47,159,160]) (Fig G).Tools that are integrated as high throughput Apps on OSG must be converted from Docker to Singularity [161].These Singularity images are cached on the OSG's Cern Virtual File Management System (CVFMS) [162] scratch file system for rapid deployment across OSG's international network.Executable Apps are defined as non-interactive CLI applications that require an input, parameters, or flags, and defined output files and directory names.Example applications include the most common tools for genomic analyses [170], as well as applications written with scripted languages like Python and R. Executable jobs are managed by HTCondor and can be run individually, sequentially as a scientific workflow, or in parallel batches over a set of input files (Fig J ).
Interactive Apps refer to any application which has its own GUI or IDE.Interactive Apps are deployed with Kubernetes, which provisions and launches the container on its own secure URL (Fig I ).The visual interactive compute environment (VICE) component of the DE acts as a graphical interface for common IDE platforms such as RStudio [47], JupyterLab [146], Visual Studio Code [49], Remote Desktops (noVNC) [147]; browser based GUI such as R Shiny [164], Python Flask [165], Java [166], or JavaScript [167].Community efforts to containerize browser-based RStudio-Server from the Rocker Project [159,160] provide researchers with hundreds of libraries built by the global R community.Project Jupyter [48,168] similarly supports a vast array of libraries written in Python, as well as other languages as add-in kernels.Remote Desktop applications (over HTTP) allow users to work in and its Terrain API.Access to apps are managed by an Ingress Controller (NGINX [169]).The analysis service shows whether the app is deployed, loading, or currently running and loads the UI for the analysis.Central authentication is managed by CAS.Users can load data from the iRODS datastore into their running containers.LDAP manages the user's secure authentication.Data Store is mirrored nightly at TACC from UArizona.their familiar desktop environments, and extended applications allow for the server-side hardware rendering of large 3D visualizations with GPU.
High Performance Apps are defined as jobs which require multiple nodes using message passing interface (MPI) [171] or Open Multi-Processing (OpenMP) [172].High performance apps are managed by TACC's Tapis (formerly Agave API) [173,174] and run at TACC relying on CyVerse Data Store mirror for faster data throughput.CyVerse has integration with the OSG via HTCondor, allowing users to deploy High Throughput Apps onto the OSG.
Three levels of metadata are operated by the DE: (1) Data Store object metadata are managed by iRODS, (2) PostgreSQL which manages User, Tool, and App metadata which is used by the featured search bar (Fig G ), (3) Metadata Templates applied through the DE to Data Store objects, storing supplemental metadata in PostgreSQL, for application in the Data Commons.
The DE brokers data access to the iRODS Data Store.When users run an App, the DE transfers input data from the Data Store to the local Linux file system.The App can also access the remote input data as if they are mounted in the local file system without making manual data transfers.When the job (HTCondor or Kubernetes) completes, the output data are written back to the Data Store.By default, all analysis data are saved under the user's /analyses folder, into a folder defaulting to the name of the app with the date and time of the application launch.The user can modify the name of the analyses' output folder and change its location within the Data Store.
The DE uses two CyVerse-developed open-software packages, iRODS FUSE Lite and an iRODS CSI driver, to broker data access to the Data Store.iRODS FUSE Lite is a tool that mounts data stored in the iRODS data store (e.g., the Data Store) on the local Linux file system and provides on-demand data access.The iRODS CSI driver manages the mounts in Kubernetes and facilitates data access in Tools.The DE integrates the Data Store using the software.

Publishing custom Tools and Apps
The only prerequisite to user's publishing their own private container apps is that the container image for the app must reside in a Docker configured to be trusted by the Discovery Environment, i.e., Harbor (https://harbor.cyverse.org).If the app publishing request container image is not in a trusted registry then the request is sent to Discovery Environment administrators as a notification.This process allows administrators to review the container image before allowing the app to be published.If the container used by the app already resides in one of the registries that are trusted by the Discovery Environment, then the app is made available immediately.
Forcing apps to be in a trusted registry provides a few important benefits.First, it allows administrators to inspect any container images that aren't already in one of the trusted registries for vulnerabilities.Second, it allows administrators to ensure that different image tags are used for different versions of the same container image, which helps to ensure reproducibility.Third, it can help to avoid errors caused by rate limiting from third party registries.Whether or not administrator intervention is required to publish the app, all newly published apps are tagged as Beta apps.Only administrators can remove this tag, so users of the app can request an inspection before using the app if they want.

Cloud-Native Services
"Cloud-native" [12,20,175]  CyVerse has successfully managed cloud-based services for over ten years.In the last five years, these have included an OpenStack web-based interface called "Atmosphere," which has become the Jetstream production research cloud.Currently, CyVerse is developing cloud-native applications for managing larger and more dynamic cloud deployments, including Kubernetes (K8s) [43], Lightweight Kurbernetes [176](Rancher K3s), Argo workflows [177], and Terraform [178,179].These products are in support of the Jetstream2, both as backend services and a new web-based UI.

How challenges are met
In 2009, grand challenges facing the Plant Sciences community included (1) assembling, visualizing and analyzing the Tree of Life (AVAToL) [158] with all 1.7 million species, (2) the exploration of genotype-to-phenotype, i.e., genotype crossed by environment equals phenotype (G×E=P), and (3) the curation with digital images of herbaria records estimated at the time to be 500 million.Critically, the iPlant Collaborative was not established to do any type of primary data collection, but rather to develop the cyberinfrastructure aspects of the challenge.In the early years of the iPlant Collaborative project, there were heated discussions amongst the community at large about the NSF's decision to support cyberinfrastructure without associated data collection.The outcome was that iPlant would focus on cyberinfrastructure: providing software and hardware solutions which link to community data which could lead to scientific discovery and transformative outcomes.Significant portions of the original iPlant Collaborative budget went to supporting travel and to developing synthesis collaborations amongst data curation and collection programs to the cyberinfrastructure (e.g., Taxonomic Name Resolution Service [TNRS] [159], Botanical Information and Ecology Network [BIEN] [160,161]) [4].Examples of big data projects still hosted in the CyVerse infrastructure include the USDA NIFA AG2PI Collaborative [162], NSF GenoPhenoEnvo [163], and PhytoOracle [164].Notable examples of projects which do collect large-scale life science data that are leveraged within CyVerse include the National Ecological Observatory Network (NEON) [165], Genome to Fields (G2P) [166][167][168][169], TerraREF [170], National Phenology Network (NPN) [171], Long Term Agricultural Research (LTAR) [172], and Long Term Ecological Research (LTER) [173] networks.Many of these data are now supported by the Ecological Data Initiative (EDI) [174], which collaborates with CyVerse to make ecological data more FAIR.

Migration to Commercial Cloud versus remaining On-Premise
When research data are hosted in separate geographic locations connected across the internet, they often require specialized data transfers [180,181].Research data can be so large they require physically carrying or transporting the data storage devices to other computing facilities (so-called "walking networks" or "sneaker-net") so they can be analyzed.This can be because the overall bandwidth of the internet available to the researchers is too small [182,183] or the data formats are not optimized for cloud-native processing [12].A growing number of research projects rely on more computational hardware than may be available at any one institution [184][185][186].Local compute clusters have given way to high performance computing, grid computing, and cloud computing.Commercial cloud services have only existed for 15 years [187], while internet connected mobile phones have only existed for 14 years [188].At-cost and small-loss pricing for cloud data hosting has become a lure for bringing publicly funded research and governmental data into commercial cloud.Profits are made on computing and data egress from such services.As of 2021, cloud is the largest revenue producer for Microsoft (Azure) [189] and the most profitable sector for Amazon (Amazon Web Services) [190].This massive shift toward cloud computing globally, along with the need for research objects, explains why 'cloud-native science' is part of our shared future, with the promise of reducing our time-to-science and increasing the overall pace of scientific discovery.However, it is not without risk or potentially high cost of ownership.
Other than CyVerse, large tech companies are the only entities with development teams large enough to create and operate middleware for managing cyberinfrastructure for research objects ??.However, tech companies need to (eventually) make a profit for their shareholders and thus may change the availability of free services to paid.Without more and larger investments into public cyberinfrastructure from state and national funding entities for science and education, researchers will increasingly turn to commercial cloud to run their science experiments at scale, which will increase the divide between those with and without financial resources [191].As researchers move toward commercial cloud, they move away from open-science and toward commercialized data access.
While the capability of commercial cloud is not in question, the price is.The cogent question is whether managing hardware on-premise is the most valuable use of financial resources for state-and national-scale research [80,81,83].In the effort to address this question and remain agile, CyVerse's SaaS and IaC are designed to be run anywhere (cloud-agnostic).If, in the future, the consensus about where to operate research computing on commercial or on-premise hardware changes, CyVerse operations could be moved.However, for the present, the financial requirements of moving data and analyses fully onto commercial clouds are beyond CyVerse's research funding capacity.

Generative AI revolution
With the emergence of generative AI and Large Language Models (LLMs) from OpenAI (ChatGPT), Google (Bard), Meta (LLaMA), and model repositories such as HuggingFace, anyone with internet access can now leverage AI-assisted programming and general work productivity.Early research reports suggest that AI assistants can improve programming and general productivity by over 50%.CyVerse already supports LLMs and integrates AI extensions into its workbench.The CyVerse Data Store can host trained models and training data, run popular applications (e.g., HuggingFace hosted Apps) in its workbench, or distribute larger model training processes to publicly available HPC/HTC and cloud platforms with GPU hardware at no cost to academic researchers (through the NSF's ACCESS-CI program).

Distribution of registered users globally and in the United States (Fig K).
Of all users, 71% are from North America, 12% Europe, 10% Asia, 3% Africa, 2% South America, and 1% Oceana (Fig ??).The majority of users self-identify as graduate and undergraduate students (Fig 4 in main text), which is not surprising given the workload distribution of modern research and CyVerse's focus on student training.By race, users identify as White, Hispanic or Latino (27.4%) and White including Arabic (17.5%),Asian or Pacific Islander (18.5%),African American or Black (6.0%), and American Indian, Alaskan Native, or Hawaiian Native (0.6%).Undefined categories included "Other" (6.6%), "Not Provided" (8.4%), and "Declined to Provide" (15.1%).

Fig
Fig B. Authentication.Users (navy blue) log in through a Web Browser (orange)where they submit their credentials through either KeyCloak (gray) and CILogon (green).These authentication requests are accepted by OAuth2.0 (black) and returned.UML template adapted from GitHub user JMBarbier (https://github.com/jmbarbier).

Fig C.
Fig C. User login screen.CyVerse authentication uses Keycloak with CILogon, GitHub, Globus Auth, or Google credentials.Users can log in with their unique CyVerse username or from their preferred single-sign on service.

Fig
Fig D. User Portal.Provides access to all other CyVerse platforms and services, as well as requests for additional data storage, cloud resources, and workshops.
It is replicated weekly to the Corral storage resource at the TACC.As of mid-2023, the CyVerse iRODS store is holding 8 petabytes (PB) of user contributed data, which total 200 million individual objects (Fig E).Currently, CyVerse iRODS handles on average 100 terabytes (TB) of uploads and 400 TB of downloads per month (Fig F).Transfer (download and upload) speeds between CyVerse and TACC and between CyVerse and cloud services currently range between 140-220 MB/s for large files.See S6 Table

Fig E .
Fig E. Data Store.The iRODS data store is accessible from a variety of multi-client access end points.The resource servers that make-up the data store include on-premises storage servers at UArizona, as well as federated storage on commercial and public research clouds.

Fig
Fig F. Data Store traffic.Data Store transfers by CyVerse users total (TiB/month) data download and upload from the Data Store (Panel A), the number of files (million/month) downloaded and uploaded (Panel B).
, and training materials (Fig G).The DE allows users to upload or download data from the Data Store via their web browser over HTTP.Users can start scientific workflows via HTCondor, Tapis, OSG, and Kubernetes by selecting public applications, or integrate their own applications as private apps.The DE is Accessible Rich Internet Applications (ARIA) compliant for users with disabilities.

Fig
Fig G. Discovery Environment User Interface.The DE uses a table of contents menu (left side) with a collapsable hamburger menu.The Data Store, Apps, and Analyses can be viewed in the mainframe.Help, updates, and user profile features are visible in the upper right corner.Administrator accounts can approve access requests (VICE), edit public Apps and Tools, approve DOI Requests, and edit Reference Genomes in the table of contents (lower left).

Fig H .
Fig H. Featured container deployment.CI/CD workflow for featured container applications in the Discovery Environment.Image recipes (Dockerfiles) are hosted on GitHub/GitLab and use build triggers with automation servers (GitHub Actions [163]) to build and tag images.Tested images are pushed to public and private registries on DockerHub and self-hosted Harbor.Images are cached on the DE production servers (nodes) for rapid deployment as containers at runtime.

Fig
Fig I. Discovery Environment Interactive.Interactive jobs include GUI apps like RStudio and JupyterLab.The DE manages interactive jobs through Kubernetes (K8s)and its Terrain API.Access to apps are managed by an Ingress Controller (NGINX[169]).The analysis service shows whether the app is deployed, loading, or currently running and loads the UI for the analysis.Central authentication is managed by CAS.Users can load data from the iRODS datastore into their running containers.LDAP manages the user's secure authentication.Data Store is mirrored nightly at TACC from UArizona.

Fig
Fig J. Discovery Environment Executable.Executable (command line interface driven) Apps are managed by HTCondor and the Terrain API.Jobs are submitted through the DE user interface where they trigger a job submission service managed by HTCondor with Advanced Message Queueing Protocol (AMQP).Once the job runs, it is sent to a node where a program called RoadRunner uses Docker-Compose to manage the execution.Data are copied back to the iRODS data store when the app completes using a custom program called porklock (https://github.com/cyverse-de/porklock). A PostgreSQL database monitors all job status and outcomes.

Fig K .
Fig K. Global and USA distribution of CyVerse user accounts.CyVerse registered accounts, by country (Panel A) and by US state (Panel B).Base-map from Carto and OpenStreetMap CC-BY 4.0 license (https://github.com/CartoDB/basemap-styles).
has a narrower definition than "cloud-based" services.Cloud-native generally refers to leveraging existing cloud with container-based January 14, 2024 15/31 environments which greatly reduce the time to set up or deploy, versus cloud-based services, which require provisioning and software stack installation.