Using Eureka HPC

Connecting to Eureka HPC

Each User has been provided a unique URL to connect to your Eureka HPC instance.  Here is how to connect to Eureka HPC:

Disconnecting from Eureka HPC

When you are done working in your Eureka HPC instance you should log out .  Here is how to log out of Eureka HPC:

Frequently Asked Questions About Connecting to Eureka

Additional Eureka HPC Features

Moving Files In & Out of Eureka HPC

Eureka is designed first and foremost to protect sensitive data files. One important aspect of this is that you cannot access the Internet directly from your app server in order to upload or download files via the usual mechanisms such as FTP, email, or web sites. Instead, you will use a specially configured location on Google Cloud Storage, called your Eureka Staging Bucket. Your Eureka Staging Bucket can be used to transfer files between your local workstation and your Eureka HPC. 

This is a two-step process: 

There are three options for doing this: Google Cloud Console, gsutil, and GCSFuse.  See below for how to use each of these options.

Important: 

 - You have the ability to use your Eureka Staging Bucket to download data to your local workstation. 

 - Just because you can do this does not mean that you should. 

 - Sensitive data such as PHI may only be downloaded to workstations or servers that comply with your institutional HIPAA policies. 

Please contact us if you have any questions.

Using the Google Cloud Console

The Google Cloud Console provides a point-and-click graphical user interface to your staging bucket. This is a good option for ad-hoc transfer of a few small files at a time. 

For scripted transfers, transfers of very large files or many files at once, use one of the other options.

Using the gsutil Command-Line Interface

The gsutil command-line interface is extremely useful for transferring large files, large groups of files, or for scripting file transfer. 

Configuring Your Credentials

gsutil is already installed on your Eureka HPC instance. To install gsutil on your local workstation from which you connect to your Eureka HPC instance instructions are found here for Mac and here for Windows.  You will need to configure your Google credentials on your Eureka HPC if you have not already done so. Run the following command from your Eureka HPC and follow the prompts. It may provide you with a long URL which you should paste into a web browser and authenticate using your Eureka credentials.

gcloud auth login 

Transferring Files

The basic syntax for transferring a file using gsutil is as follows:

gsutil cp [source] [destination]

Local files are specified following usual syntax, for example ~/myfile.txt. Your bucket will be specified as gs://[projectid-staging].

Examples, assuming a project id of hdcekaxmp:

gsutil cp myfile.txt gs://hdcekaxmp-staging

gsutil cp gs://hdcekaxmp-staging/myfile.txt. 

gsutil cp gs://hdcekaxmp1-staging/obj
gs://hdcekaxmp2-staging/obj2

More Examples

Using GCSFuse

GCSFuse allows you to mount your staging bucket as a folder within a Linux or MacOS filesystem. (This feature is not available on Windows systems.) You can use GCSFuse to mount your staging bucket on your Eureka HPC, your local workstation, or both. 

Setting Up GCSFuse on Your Eureka HPC

GCSFuse is already installed on your Eureka HPC -- you only need to configure it. 

Advanced users may wish to explore modifying fstab to mount their staging bucket by default at startup, thereby skipping Step 3. See the GCSFuse documentation for details.

Setting Up GCSFuse on Your Local Workstation

Configuring GCSFuse on your local workstation is nontrivial, but can be very useful. By mounting both your Eureka HPC and your local workstation, you can seamlessly move files between systems without making calls to gsutil. See the following links for more information:

Frequently Asked Questions: Moving data in/out of Eureka HPC

Accessing Compass Data Marts in Eureka

If you have been authorized access to a Health Data Compass data mart in BigQuery, you can safely view it from your Eureka HPC. You can also download it to your Eureka HPC for further analysis on your Eureka HPC instance.

Via the Command Line

You can access BigQuery datasets using the “bq” command line utility. This is a powerful utility, and full documentation can be found here: https://cloud.google.com/bigquery/docs/bq-command-line-tool. Below are a few simple examples for common uses: 

Examples using Command Line to access data

Examples: Exploring Data 

See what datasets you can access in a project: 

bq --project_id [project-name] ls

See what tables are in a dataset:

bq --dataset_id [project-name]:[dataset-name] ls

Show the schema of a table:

bq show [project-name]:[dataset-name].[table-name] 

Show the first few rows of a table:

bq head [project-name]:[dataset-name].[table-name] 

Examples: Querying Data 

*Note 1: In this and all SELECT examples, if  the name of the project that contains the data you are querying has a hyphen in it, you may need to surround any table identifiers with backticks, as follows: `[project-name]:[dataset-name].[table-name]`

*Note 2: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.

Execute a SELECT query from the command line and view the results:

 bq query --use_legacy_sql=false  “select (*) from [PROJECT]:[DATASET].[TABLE]”

Execute a SELECT query from a query that’s stored in a file (for more complex queries) and view the results:

 cat [LOCAL-SQL-FILENAME] | bq query --use_legacy_sql=false

Examples: Downloading Data 

*Note 1: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.

*Note 2: The “bq query” command will return a maximum of 16,000 rows. For larger datasets, see the example for “bq extract”

Output the results of a SELECT command to a CSV file:

bq query --use_legacy_sql=false --format=csv  "select (*) from [PROJECT]:[DATASET].[TABLE]" > result.csv

Export a table to a file in your Google Cloud Storage Staging Bucket:

bq extract --destination_format CSV --field_delimiter “,”  [PROJECT]:[DATASET].[TABLE] gs://[EUREKA-PROJECT]-staging/[FILENAME]

Copy a file from your Google Cloud Storage Staging Bucket to your HPC:

gsutil cp gs://[EUREKA-PROJECT]-staging/[FILENAME] [FILENAME]

Export the results of a large query (>16,000 resulting rows) to your BigQuery Staging Dataset:

     1  . Query the data and store the results in a new BigQuery table.

bq query --use_legacy_sql=false –destination_table [EUREKA-PROJECT-ID]:staging.[TABLE] "select (*) from [PROJECT]:[DATASET].[TABLE]"

      2. Use the instructions above to export the new table to a file in Google Cloud Storage.

      3. Use the instructions above to copy the file from your Google Cloud Storage Staging Bucket to your HPC. 

Running SLURM Jobs on Eureka HPC

Jobs are submitted to the SLURM workload manager.  Primarily, you will use the sbatch command to submit jobs, and the squeue command to monitor jobs. You can submit any valid shell script as a job using sbatch. Once it’s submitted, Eureka HPC will create a temporary compute node just for this job. As soon as this node has no more work to do, it will be deleted.  This is the core cost saving feature of Eureka HPC

If you submit a job with sbatch and provide no options to SLURM, your job will be submitted with the following defaults:

To exceed 23 hours of runtime, or to get more than 4 cores, you must submit the job to a non-default SLURM partition. You can see all of the available partitions in your SLURM setup by typing sinfo.  Each partition corresponds to a different type of Google Cloud VM with different resources. For example, you will see partitions called c2s16 and c2s16_nonpre.  These correspond to 16 core, 64 GB RAM VMs.  The second one is a non-preemptible VM.  Non-preemptible VMs cost approx 3x per hour what preemptible VMs cost, which is why they are not the default.  If you job can run in less than 23 hours, or if it can be modified to break its work up in to smaller chunks that can run in less than 23 hours, you should choose a preemptible node to minimize costs.  All preemptible partitions are limited to 23 hour job runtime.  To choose a non-default partition, add a line like this to your batch script (example): #SBATCH -p c2s16. In this example, your job would then run on a 16 core VM.  By default, Eureka HPC includes GPU compute capability, available in a GPU SLURM partition. NVidia CUDA drivers are installed by default. Please note that GPUs are expensive, and the CPUs attached to these instances are slower than otherwise available in Eureka HPC. So, you should not choose the GPU partition unless you have a GPU-enabled program.

Running interactive SLURM jobs

You should not run any interactive processes that require more than minimal CPU on the login node.  Things like text editors are fine, but commands like sort or R should be run as a SLURM interactive job so that they get run on powerful hardware.  You can start an interactive job by typing interact. There may be a delay of approx. 1 minute after you type this, and then you will receive a shell prompt on a compute node.

SLURM Best Practices on Eureka HPC

The ideal batch job is longer than a few minutes, but shorter than a day.  Jobs in this range of lengths will be scheduled more efficiently by SLURM, and so you will get more work through the system and, on average experience less queue wait time if you keep your jobs within these limits.

Batch jobs should be able to be killed and restarted without losing too much progress.  You can accomplish this by writing your code to checkpoint, or simply by breaking up a long running job into multiple shorter jobs.

Do not hard code the UNIX path of your home directory into your SLURM scripts.  If you need to reference a file in your scripts, use the shell’s $HOME variable.

Storage options on Eureka HPC

To minimize dollars spent on computing and storage, you must be aware of the different types of Google storage available in Eureka HPC:

Google Cloud Storage

This is Google’s most cost effective storage method for long term storage.  Ideally, you should store both input data here, and also results.  You must move data in and out of Google Cloud Storage using the gsutil tool or the Google Cloud web console. Instructions on using Google Cloud storage can be found above.

/home

Your home directory exists only for holding files that are required for your Linux account to function.  It is small and slow, but it’s OK to temporarily allow SLURM logs to be written in your home, or other small files.

/tmp on compute nodes

Each Google Cloud compute node has local SSD storage attached directly to it.  This storage is both fast and inexpensive, but it exists only as long as the node is running, and is destroyed when the node shuts down.  It is thus only useful as working storage while your job is running. As the last step in your job, you must copy out any results that have been stored in /tmp to a permanent location, like Google Cloud Storage.

Optional shared storage mounted at /gpfs

Primarily to aid with migrating existing Rosalind (on-prem HPC) workflows, Eureka HPC instances may be ordered with shared storage mounted at /gpfs. This is not a standard feature because it entails significant added cost.  Per GB, it is 5-10x more expensive than Google Cloud storage. If you have existing Rosalind jobs, and do not want to modify them to use Google Cloud Storage and /tmp, then this shared storage will allow you to make that cost vs. convenience tradeoff.


Limited Internet Access from Eureka HPC

Eureka HPC has the ability to connect to the following URLs from within Eureka via the Eureka Limited Internet App.   

The first time you use the internet in Eureka you will need to run the following command from your Eureka HPC and follow the prompts. It may provide you with a long URL which you should paste into another tab in your browser and authenticate using your Eureka credentials.

gcloud auth login 

To access any of the URLs below, you will need to run the corresponding console command, listed below.  To see your choices from within the terminal, type eureka-internet and then hit the tab key twice. It will print all of the possible internet commands. Once you run a command, the connection will open in around 5 seconds, but can take up to 15. Access to the site is limited to 30 minutes, so if you need the connection open for longer, run the command again and it will add another 30 minutes of connection.


Console Command: eureka-internet-CRAN-Bioconductor

Sites:

https://cloud.r-project.org

https://www.bioconductor.org


Console Command: eureka-internet-GitHub.com

Sites:

https://github.com

https://raw.githubusercontent.com


Console Command: eureka-internet-Python.org

Sites:

https://pypi.org

https://www.python.org


Console Command: eureka-internet-RedCap

Sites:

https://redcap.ucdenver.edu

NOTE: Some R Packages require access to GitHub at the same time to CRAN so make sure you run both commands to ensure complete installation of those packages.

When you are done with your session, you can logout of gcloud by running the following command in the terminal:

gcloud auth revoke

Installing R packages from CRAN

Access to CRAN is available through the Limited Internet Access feature.  Follow the Limited Internet Access steps above to get connected to CRAN.  Once CRAN is accessible you can install packages from CRAN using install.packages() in the usual way. 

Internet Security & Eureka HPC

Security is a group effort between you and Compass.  We cannot do it without you. Please be sure to follow all rules in the Eureka User Agreement.

Some common problems with software downloaded from the internet include:

You must ensure that you have carefully reviewed software from any source for these problems, but be particularly careful with container hubs (such as Docker Hub) and software from GitHub that is not widely used.  Due to the difficulty of determining the trustworthiness of software on container hubs, we discourage their use.  You are responsible for vetting software you upload to Eureka HPC.

You must not store confidential information on sites outside Eureka, unless you have received specific permission. You must never store confidential information on GitHub.

Frequently Asked Questions: Limited Internet Access

Using Python with Eureka HPC

Compass highly recommends using python through pycharm and using personal virtual environment in python, using these steps:

Google Cloud Source Repository

Each Eureka HPC instance has Google Cloud Source Repository set up and enabled for sharing code files between multiple users on a shared Eureka instance.

Note that sensitive data like PHI should never be included in code files.  This includes those shared on other code sharing platforms like GitHub.