Using Eureka HPC

Connecting to Eureka HPC

Each User has been provided a unique URL to connect to your Eureka HPC instance. Here is how to connect to Eureka HPC:

Using the current version Chrome or Firefox web browser go to your custom HPC URL. (Private or Incognito windows are okay to use)
Use your Compass Eureka GCP User Account to authenticate with Google (including 2-factor authentication).
Use your Compass Eureka GCP User Account to log into Eureka HPC.
Once you are logged into Eureka HPC, click on the Terminal tab on the left side of the webpage (this is the Eureka HPC terminal).

Disconnecting from Eureka HPC

When you are done working in your Eureka HPC instance you should log out . Here is how to log out of Eureka HPC:

Click on your name in the upper right hand corner of your Eureka HPC web page and select 'Log Out'.

Frequently Asked Questions About Connecting to Eureka

What if I forgot my Compass Eureka GCP User Account name and/or password?
- Please contact Compass to reset your account credentials. These will be sent to the e-mail that you used to request your Compass Eureka GCP User Account.

Additional Eureka HPC Features

Connecting to Eureka HPC

Moving Files In & Out of Eureka HPC

Accessing Compass Data Marts in Eureka

Running SLURM Jobs on Eureka HPC

Storage options on Eureka HPC

Limited Internet Access from Eureka HPC

Installing R packages from CRAN

Internet Security & Eureka HPC

Using Python with Eureka HPC

Google Cloud Source Repository

Moving Files In & Out of Eureka HPC

Eureka is designed first and foremost to protect sensitive data files. One important aspect of this is that you cannot access the Internet directly from your app server in order to upload or download files via the usual mechanisms such as FTP, email, or web sites. Instead, you will use a specially configured location on Google Cloud Storage, called your Eureka Staging Bucket. Your Eureka Staging Bucket can be used to transfer files between your local workstation and your Eureka HPC.

This is a two-step process:

Upload files from your app server or local workstation to your staging bucket
Download the files from your staging bucket to your app server or local workstation

There are three options for doing this: Google Cloud Console, gsutil, and GCSFuse. See below for how to use each of these options.

Important:

- You have the ability to use your Eureka Staging Bucket to download data to your local workstation.

- Just because you can do this does not mean that you should.

- Sensitive data such as PHI may only be downloaded to workstations or servers that comply with your institutional HIPAA policies.

Please contact us if you have any questions.

Using the Google Cloud Console

The Google Cloud Console provides a point-and-click graphical user interface to your staging bucket. This is a good option for ad-hoc transfer of a few small files at a time.

For scripted transfers, transfers of very large files or many files at once, use one of the other options.

Open a web browser to https://console.cloud.google.com/storage
If prompted, authenticate using your Compass User Account credentials .
Make sure the dropdown list to the right of the "Google Cloud Platform" logo on the top-left of the screen contains the name of your Eureka project. If it does not, click the down-arrow, and select your project.
Within the table called "Buckets," you will see the name of your Eureka Staging Bucket, in the format [projectname-staging]. Click the name of the bucket to open the bucket.
To upload files into your staging bucket, use the "Upload Files" or "Upload Folders" buttons. Alternatively, you can drag and drop files onto the whitespace on the bottom-right quadrant of the page to upload.
To download files from your staging bucket, click on file names to download files via your web browser and select the location you want to download the file to.

Using the gsutil Command-Line Interface

The gsutil command-line interface is extremely useful for transferring large files, large groups of files, or for scripting file transfer.

Configuring Your Credentials

gsutil is already installed on your Eureka HPC instance. To install gsutil on your local workstation from which you connect to your Eureka HPC instance instructions are found here for Mac and here for Windows. You will need to configure your Google credentials on your Eureka HPC if you have not already done so. Run the following command from your Eureka HPC and follow the prompts. It may provide you with a long URL which you should paste into a web browser and authenticate using your Eureka credentials.

gcloud auth login

Transferring Files

The basic syntax for transferring a file using gsutil is as follows:

gsutil cp [source] [destination]

Local files are specified following usual syntax, for example ~/myfile.txt. Your bucket will be specified as gs://[projectid-staging].

Examples, assuming a project id of hdcekaxmp:

To copy a local file to your staging bucket (this works from your Eureka App VM too):

gsutil cp myfile.txt gs://hdcekaxmp-staging

To copy a file from your staging bucket to a local file:

gsutil cp gs://hdcekaxmp-staging/myfile.txt.

To copy a file from one bucket to another bucket:

gsutil cp gs://hdcekaxmp1-staging/obj
gs://hdcekaxmp2-staging/obj2

More Examples

The gsutil cp command is powerful, supporting wildcards, simultaneous file transfers, resumeable transfers, and more. For examples, see the gsutil cp documentation.
To synchronize entire folder hierarchies with your staging bucket, see the gsutil rsync command.

Using GCSFuse

GCSFuse allows you to mount your staging bucket as a folder within a Linux or MacOS filesystem. (This feature is not available on Windows systems.) You can use GCSFuse to mount your staging bucket on your Eureka HPC, your local workstation, or both.

Setting Up GCSFuse on Your Eureka HPC

GCSFuse is already installed on your Eureka HPC -- you only need to configure it.

(One time only) Execute the following two commands to authenticate to Google Cloud:
- gcloud auth login [your-compass-user@account.org]
- gcloud auth application-default login
(One time only) Create a folder at which to mount the bucket:
- mkdir ~/gcs
(Each time you start your VM) Mount the folder, using the name of your staging bucket:
- gcsfuse [projectid-staging] ~/gcs

Advanced users may wish to explore modifying fstab to mount their staging bucket by default at startup, thereby skipping Step 3. See the GCSFuse documentation for details.

Setting Up GCSFuse on Your Local Workstation

Configuring GCSFuse on your local workstation is nontrivial, but can be very useful. By mounting both your Eureka HPC and your local workstation, you can seamlessly move files between systems without making calls to gsutil. See the following links for more information:

Frequently Asked Questions: Moving data in/out of Eureka HPC

What types of data or objects can I move out of Eureka to a workspace that does not comply with my Institution's HIPAA policies?
- Data that is not sensitive such as PHI can be moved outside of Eureka. This includes data that is de-identified or summarized statistically or graphically visualizations. Also make sure you are following your Institution's policies about PHI data and safe storage outside of Eureka.
What should I do if I need to move PHI out of Eureka to a workspace that does not comply with my Institution's HIPAA policies?
- Before doing this please contact Compass and your Institution for further directions.

Accessing Compass Data Marts in Eureka

If you have been authorized access to a Health Data Compass data mart in BigQuery, you can safely view it from your Eureka HPC. You can also download it to your Eureka HPC for further analysis on your Eureka HPC instance.

Via the Command Line

You can access BigQuery datasets using the “bq” command line utility. This is a powerful utility, and full documentation can be found here: https://cloud.google.com/bigquery/docs/bq-command-line-tool. Below are a few simple examples for common uses:

Examples using Command Line to access data

Examples: Exploring Data

See what datasets you can access in a project:

bq --project_id [project-name] ls

See what tables are in a dataset:

bq --dataset_id [project-name]:[dataset-name] ls

Show the schema of a table:

bq show [project-name]:[dataset-name].[table-name]

Show the first few rows of a table:

bq head [project-name]:[dataset-name].[table-name]

Examples: Querying Data

*Note 1: In this and all SELECT examples, if the name of the project that contains the data you are querying has a hyphen in it, you may need to surround any table identifiers with backticks, as follows: `[project-name]:[dataset-name].[table-name]`

*Note 2: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.

Execute a SELECT query from the command line and view the results:

bq query --use_legacy_sql=false “select (*) from [PROJECT]:[DATASET].[TABLE]”

Execute a SELECT query from a query that’s stored in a file (for more complex queries) and view the results:

cat [LOCAL-SQL-FILENAME] | bq query --use_legacy_sql=false

Examples: Downloading Data

*Note 1: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.

*Note 2: The “bq query” command will return a maximum of 16,000 rows. For larger datasets, see the example for “bq extract”

Output the results of a SELECT command to a CSV file:

bq query --use_legacy_sql=false --format=csv "select (*) from [PROJECT]:[DATASET].[TABLE]" > result.csv

Export a table to a file in your Google Cloud Storage Staging Bucket:

bq extract --destination_format CSV --field_delimiter “,” [PROJECT]:[DATASET].[TABLE] gs://[EUREKA-PROJECT]-staging/[FILENAME]

Copy a file from your Google Cloud Storage Staging Bucket to your HPC:

gsutil cp gs://[EUREKA-PROJECT]-staging/[FILENAME] [FILENAME]

Export the results of a large query (>16,000 resulting rows) to your BigQuery Staging Dataset:

1 . Query the data and store the results in a new BigQuery table.

bq query --use_legacy_sql=false –destination_table [EUREKA-PROJECT-ID]:staging.[TABLE] "select (*) from [PROJECT]:[DATASET].[TABLE]"

2. Use the instructions above to export the new table to a file in Google Cloud Storage.

3. Use the instructions above to copy the file from your Google Cloud Storage Staging Bucket to your HPC.

Running SLURM Jobs on Eureka HPC

Jobs are submitted to the SLURM workload manager. Primarily, you will use the sbatch command to submit jobs, and the squeue command to monitor jobs. You can submit any valid shell script as a job using sbatch. Once it’s submitted, Eureka HPC will create a temporary compute node just for this job. As soon as this node has no more work to do, it will be deleted. This is the core cost saving feature of Eureka HPC

If you submit a job with sbatch and provide no options to SLURM, your job will be submitted with the following defaults:

It will run on a 4 core node with 16 GB of RAM
It will have a maximum run time of 23 hours
It will be run on a Google preemptible VM for maximum cost savings

To exceed 23 hours of runtime, or to get more than 4 cores, you must submit the job to a non-default SLURM partition. You can see all of the available partitions in your SLURM setup by typing sinfo. Each partition corresponds to a different type of Google Cloud VM with different resources. For example, you will see partitions called c2s16 and c2s16_nonpre. These correspond to 16 core, 64 GB RAM VMs. The second one is a non-preemptible VM. Non-preemptible VMs cost approx 3x per hour what preemptible VMs cost, which is why they are not the default. If you job can run in less than 23 hours, or if it can be modified to break its work up in to smaller chunks that can run in less than 23 hours, you should choose a preemptible node to minimize costs. All preemptible partitions are limited to 23 hour job runtime. To choose a non-default partition, add a line like this to your batch script (example): #SBATCH -p c2s16. In this example, your job would then run on a 16 core VM. By default, Eureka HPC includes GPU compute capability, available in a GPU SLURM partition. NVidia CUDA drivers are installed by default. Please note that GPUs are expensive, and the CPUs attached to these instances are slower than otherwise available in Eureka HPC. So, you should not choose the GPU partition unless you have a GPU-enabled program.

Running interactive SLURM jobs

You should not run any interactive processes that require more than minimal CPU on the login node. Things like text editors are fine, but commands like sort or R should be run as a SLURM interactive job so that they get run on powerful hardware. You can start an interactive job by typing interact. There may be a delay of approx. 1 minute after you type this, and then you will receive a shell prompt on a compute node.

SLURM Best Practices on Eureka HPC

The ideal batch job is longer than a few minutes, but shorter than a day. Jobs in this range of lengths will be scheduled more efficiently by SLURM, and so you will get more work through the system and, on average experience less queue wait time if you keep your jobs within these limits.

Batch jobs should be able to be killed and restarted without losing too much progress. You can accomplish this by writing your code to checkpoint, or simply by breaking up a long running job into multiple shorter jobs.

Do not hard code the UNIX path of your home directory into your SLURM scripts. If you need to reference a file in your scripts, use the shell’s $HOME variable.

Try it: Type echo $HOME
If you have a file called myfile in the top level of your home, you can reference it as $HOME/myfile

Storage options on Eureka HPC

To minimize dollars spent on computing and storage, you must be aware of the different types of Google storage available in Eureka HPC:

Google Cloud Storage
/home
/tmp on compute nodes
Optionally, shared storage mounted at /gpfs

Google Cloud Storage

This is Google’s most cost effective storage method for long term storage. Ideally, you should store both input data here, and also results. You must move data in and out of Google Cloud Storage using the gsutil tool or the Google Cloud web console. Instructions on using Google Cloud storage can be found above.

/home

Your home directory exists only for holding files that are required for your Linux account to function. It is small and slow, but it’s OK to temporarily allow SLURM logs to be written in your home, or other small files.

/tmp on compute nodes

Each Google Cloud compute node has local SSD storage attached directly to it. This storage is both fast and inexpensive, but it exists only as long as the node is running, and is destroyed when the node shuts down. It is thus only useful as working storage while your job is running. As the last step in your job, you must copy out any results that have been stored in /tmp to a permanent location, like Google Cloud Storage.

Optional shared storage mounted at /gpfs

Primarily to aid with migrating existing Rosalind (on-prem HPC) workflows, Eureka HPC instances may be ordered with shared storage mounted at /gpfs. This is not a standard feature because it entails significant added cost. Per GB, it is 5-10x more expensive than Google Cloud storage. If you have existing Rosalind jobs, and do not want to modify them to use Google Cloud Storage and /tmp, then this shared storage will allow you to make that cost vs. convenience tradeoff.

Limited Internet Access from Eureka HPC

Eureka HPC has the ability to connect to the following URLs from within Eureka via the Eureka Limited Internet App.

The first time you use the internet in Eureka you will need to run the following command from your Eureka HPC and follow the prompts. It may provide you with a long URL which you should paste into another tab in your browser and authenticate using your Eureka credentials.

gcloud auth login

To access any of the URLs below, you will need to run the corresponding console command, listed below. To see your choices from within the terminal, type eureka-internet and then hit the tab key twice. It will print all of the possible internet commands. Once you run a command, the connection will open in around 5 seconds, but can take up to 15. Access to the site is limited to 30 minutes, so if you need the connection open for longer, run the command again and it will add another 30 minutes of connection.

Console Command: eureka-internet-CRAN-Bioconductor

Sites:

https://cloud.r-project.org

https://www.bioconductor.org

Console Command: eureka-internet-GitHub.com

Sites:

https://github.com

https://raw.githubusercontent.com

Console Command: eureka-internet-Python.org

Sites:

https://pypi.org

https://www.python.org

Console Command: eureka-internet-RedCap

Sites:

https://redcap.ucdenver.edu

NOTE: Some R Packages require access to GitHub at the same time to CRAN so make sure you run both commands to ensure complete installation of those packages.

When you are done with your session, you can logout of gcloud by running the following command in the terminal:

gcloud auth revoke

Installing R packages from CRAN

Access to CRAN is available through the Limited Internet Access feature. Follow the Limited Internet Access steps above to get connected to CRAN. Once CRAN is accessible you can install packages from CRAN using install.packages() in the usual way.

Internet Security & Eureka HPC

Security is a group effort between you and Compass. We cannot do it without you. Please be sure to follow all rules in the Eureka User Agreement.

Some common problems with software downloaded from the internet include:

Outdated software with known security vulnerabilities
Software that includes poor programming or security practices
Malicious software such as viruses

You must ensure that you have carefully reviewed software from any source for these problems, but be particularly careful with container hubs (such as Docker Hub) and software from GitHub that is not widely used. Due to the difficulty of determining the trustworthiness of software on container hubs, we discourage their use. You are responsible for vetting software you upload to Eureka HPC.

You must not store confidential information on sites outside Eureka, unless you have received specific permission. You must never store confidential information on GitHub.

Frequently Asked Questions: Limited Internet Access

I'm in need of access to a URL not on the list above, what can I do?

- Please contact Compass with the specific URL and details about why access to this site is needed. Compass will then complete an analysis and if it passes, add it to the Eureka Whitelist App, and update the list of of sites above.

Using Python with Eureka HPC

Compass highly recommends using python through pycharm and using personal virtual environment in python, using these steps:

From your Eureka App VM confirm you are logged into gcloud by opening a terminal window and entering: gcloud auth list
- If you are not logged in enter gcloud auth login and follow the prompts provided.
In search bar, type white and an orange icon will appear. Click on the website which you would use to download packages.
In a Eureka App VM terminal, enter pycharm and hit enter. A GUI will appear and accept the agreement from pycharm.
Pycharm is setup to create a virtual environment, it will ask for location and name to create the virtual environment.
Once the virtual environment is created, in the left bottom corner locate an icon for terminal and click on it.
You can now install packages in this virtual environment from terminal by pip install package_name

Google Cloud Source Repository

Each Eureka HPC instance has Google Cloud Source Repository set up and enabled for sharing code files between multiple users on a shared Eureka instance.

Note that sensitive data like PHI should never be included in code files. This includes those shared on other code sharing platforms like GitHub.