Using Eureka

Connecting to Eureka v2

Each User has been provided a unique URL to connect to your Eureka App VM. Here is how to connect to Eureka v2:

  1. Using Chrome or Firefox web browser go to your custom App VM URL. Do not use a private or incognito window.
  2. Use your Compass User Account to authenticate with Google (including 2-factor authentication).
  3. Use your Compass User Account to log into NoMachine. If you get a Server Error message that is because Google is starting up the VM (it may take up to 3 minutes for the VM to start), refresh the browser and then you'll be in.
  4. In the NoMachine window select 'Create a New Virtual Desktop' and select continue.
  5. You're in your App VM once you see the Centos7 background image!
  6. If you are going to use RStudio to connect to your App VM, connect to your App VM first as this will create the home directory needed for RStudio to run. Do this each time you need to login to RStudio.

Moving Files In & Out of Eureka

Eureka is designed first and foremost to protect sensitive data files. One important aspect of this is that you cannot access the Internet directly from your app server in order to upload or download files via the usual mechanisms such as FTP, email, or web sites. Instead, you will use a specially configured location on Google Cloud Storage, called your Eureka Staging Bucket. Your Eureka Staging Bucket can be used to transfer files between your local workstation and your Eureka App Server.

This is a two-step process:

  1. Upload files from your app server or local workstation to your staging bucket
  2. Download the files from your staging bucket to your app server or local workstation

There are three options for doing this: Google Cloud Console, gsutil, and GCSFuse. See below for how to use each of these options.

Important:

- You have the ability to use your Eureka Staging Bucket to download data to your local workstation.

- Just because you can do this does not mean that you should.

- Sensitive data such as PHI may only be downloaded to workstations or servers that comply with your institutional HIPAA policies.

Please contact us if you have any questions.

Using the Google Cloud Console

The Google Cloud Console provides a point-and-click graphical user interface to your staging bucket. This is a good option for ad-hoc transfer of a few small files at a time.

For scripted transfers, transfers of very large files or many files at once, use one of the other options.

  1. Open a web browser to https://console.cloud.google.com/storage
  2. If prompted, authenticate using your Eureka credentials (typically ending with @hdcuser.org).
  3. Make sure the dropdown list to the right of the "Google Cloud Platform" logo on the top-left of the screen contains the name of your Eureka project. If it does not, click the down-arrow, and select your project.
  4. Within the table called "Buckets," you will see the name of your Eureka Staging Bucket, in the format [projectname-staging]. Click the name of the bucket to open the bucket.
  5. To upload files into your staging bucket, use the "Upload Files" or "Upload Folders" buttons. Alternatively, you can drag and drop files onto the whitespace on the bottom-right quadrant of the page to upload.
  6. To download files from your staging bucket, click on file names to download files via your web browser and select the location you want to download the file to.

Using the gsutil Command-Line Interface

The gsutil command-line interface is extremely useful for transferring large files, large groups of files, or for scripting file transfer.

Configuring Your Credentials

gsutil is already installed on your Eureka App VM. To install gsutil on your local workstation from which you connect to your Eureka App VMs instructions are found here for Mac and here for Windows. You may need to configure your Google credentials on your Eureka App VM if you have not already done so. Run the following command from your Eureka App VM and follow the prompts. It may provide you with a long URL which you should paste into a web browser and authenticate using your Eureka credentials.

gcloud auth application-default login 

Transferring Files

The basic syntax for transferring a file using gsutil is as follows:

gsutil cp [source] [destination]

Local files are specified following usual syntax, for example ~/myfile.txt. Your bucket will be specified as gs://[projectid-staging].

Examples, assuming a project id of hdcekaxmp:

  • To copy a local file to your staging bucket (this works from your Eureka App VM too):
gsutil cp myfile.txt gs://hdcekaxmp-staging
  • To copy a file from your staging bucket to a local file:
gsutil cp gs://hdcekaxmp-staging/myfile.txt. 

More Examples

  • The gsutil cp command is powerful, supporting wildcards, simultaneous file transfers, resumeable transfers, and more. For examples, see the gsutil cp documentation.
  • To synchronize entire folder hierarchies with your staging bucket, see the gsutil rsync command.

Using GCSFuse

GCSFuse allows you to mount your staging bucket as a folder within a Linux or MacOS filesystem. (This feature is not available on Windows systems.) You can use GCSFuse to mount your staging bucket on your Eureka App VM, your local workstation, or both.

Setting Up GCSFuse on Your Eureka App VM

GCSFuse is already installed on your Eureka App VM -- you only need to configure it.

  1. (One time only) Execute the following two commands to authenticate to Google Cloud:
  2. (One time only) Create a folder at which to mount the bucket:
    • mkdir ~/gcs
  3. (Each time you start your VM) Mount the folder, using the name of your staging bucket:
    • gcsfuse [projectid-staging] ~/gcs

Advanced users may wish to explore modifying fstab to mount their staging bucket by default at startup, thereby skipping Step 3. See the GCSFuse documentation for details.

Setting Up GCSFuse on Your Local Workstation

Configuring GCSFuse on your local workstation is nontrivial, but can be very useful. By mounting both your Eureka App VM and your local workstation, you can seamlessly move files between systems without making calls to gsutil. See the following links for more information:

R on Eureka App VM

Each Eureka instance comes with R installed on the Eureka App VM, the current version of R installed is R 3.6. If you need to update the version of R installed on your Eureka App VM, run the following command in your Eureka terminal:

sudo yum upgrade R

bigrquery is an option to use R from the Eureka App VM and connect to data directly in Google BigQuery. bigrquery is a package available from CRAN (details on bigrquery can be found here). bigrquery v1.2.0 has dependent packages that need to be installed, guidance is provided below for how to install packages from CRAN and other sources.

Below are 3 options for installing R packages on your Eureka App VM. Instructions for manually installing dependent R packages and Linux libraries are below as well.

Installing R packages from CRAN

Health Data Compass maintains a mirror of the CRAN repository. R is preconfigured within Eureka instances to access this mirror for installation of packages from CRAN using install.packages() in the usual way. If you have problems installing a standard CRAN package, please contact us.

Installing R packages from GitHub

Many R packages are hosted on GitHub. At present, for security reasons, Eureka virtual machines do not have access to GitHub, so the usual install_github() command in R will return an error. In addition, there is an incompatibility between GitHub and R in the way that .zip files are handled, which requires some additional steps.

Do the following:

  • Locate the repository containing the package you wish to install on github.com
  • Use the green button on the home page of the repository to download the repository in .zip format
  • Follow the instructions in Moving Files Into Eureka to copy the .zip file to your Eureka virtual machine. The file can be stored anywhere on the virtual machine, but you may wish to place it in a folder to contain packages you install in this way, e.g., ~/mypackages.
  • From the command prompt of your Eureka virtual machine, unzip the .zip file, e.g.:
unzip github-repo.zip
  • From R, use install.packages(), specifying the path to the folder containing the package, e.g.:
install.packages('~/mypackages/github-repo-master', repos=NULL)

Installing R packages from other Sources

Installing R Packages from other sources is possible. Take special care that you only download and install packages from trusted sources.

For .Zip files do the following:

  • Download the .zip file containing the package you wish to install to your local workstation.
  • Follow the instructions in Moving Files Into Eureka to copy the .zip file to your Eureka virtual machine. The file can be stored anywhere on the virtual machine, but you may wish to place it in a folder to contain packages you install in this way, e.g., ~/mypackages.
  • From R, use install.packages(), specifying the path to the .zip file containing the package, e.g.:
install.packages('~/mypackages/the-package.zip', repos=NULL)

*Note: If you get an error stating "embedded nul in string", then the .zip file is probably suffering from the same incompatibility as described in Installing Packages from GitHub. Follow those instructions to unzip the repository and install it from its unzipped subfolder.

Stringi R Package:

This popular R package does not reside in the CRAN repository. To install from R, run the following command:

install.packages("stringi", configure.vars="ICUDT_DIR=/srv/repos/eureka/7/v2/files")

Installing missing Linux Libraries

Some R packages depend on Linux operating system libraries that may not be installed on your Eureka virtual machine by default. If install.packages returns errors about missing libraries, you can install these from the CentOS mirror maintained by Health Data Compass.

Do the following:

  • Identify the name of the package you wish to install, e.g., "curl"
  • From the command prompt of your Eureka App VM, install it using the yum package manager, e.g.:
sudo yum install curl

Manually Installing Dependencies

Many R packages are dependent on other R packages. Dependencies in CRAN will be resolved and installed automatically through Health Data Compass's CRAN mirror.

Unfortunately, dependencies hosted in other locations, such as GitHub, will need to be manually installed. You can simply attempt to install the base package using install.packages(), wait for an error complaining of a missing package, install the missing package, attempt to install the base package again, and repeat until all dependencies are found. But if the base package has many dependencies, it may be more efficient to view the DESCRIPTION file found within the base package .zip file. Look for the Imports and Suggests tags, which will list any required and suggested dependencies, respectively. You can then proactively install each of these dependent packages one at a time, using the instructions above, and then install the base package when all dependencies are in place.

Google Cloud Source Repository

Each Eureka instance has Google Cloud Source Repository set up and enabled for sharing code files between multiple users on a shared Eureka instance.

Installing Bioconductor repository within Eureka

Health Data Compass maintains a mirror of the Bioconductor repository for use with each Eureka instance. Connecting to the Bioconductor repository requires a one-time installation on each Eureka instance. If you have any issues completing the installation, contact Compass.

NOTE: Bioconductor repository requires the latest version of R (As of July 2019 this is R 3.6).

  • From the command prompt of your Eureka App VM terminal, modify your .Rprofile to reference the Bioconductor repository by running the following command:
nano $HOME/.Rprofile
  • Add the following two lines to your .Rprofile and save the changes.
options(BioC_mirror = "http://bioc.hdcuser.org/bioc")
options(BIOCONDUCTOR_ONLINE_VERSION_DIAGNOSIS=FALSE)

The resulting profile should look like this:

local({
 r <- getOption("repos")
 r["CRAN"] <- "http://cran.hdcuser.org/"
 options(repos = r)
 options(BioC_mirror = "http://bioc.hdcuser.org/bioc")
 options(BIOCONDUCTOR_ONLINE_VERSION_DIAGNOSIS=FALSE)
})
  • Next, install Bioconductor packages in R by opening R within the terminal by typing ‘R’ and clicking enter. Then run the following command:
install.packages("http://bioc.hdcuser.org/BiocManager-1.30.5.tar.gz", repos = NULL, type = "source")
install.packages("BiocVersion", repos="http://bioc.hdcuser.org/bioc/packages/3.9/bioc")
BiocManager::install(version="3.9")

Accessing Compass Data Sets in Eureka

If you have been authorized access to a Health Data Compass data set in BigQuery, you can safely view it from your Eureka App VM. You an also download it to your Eureka App VM for further analysis on your Eureka App VM.

Via the Web User Interface

You can interact with BigQuery through the BigQuery user interface via your Eureka App VM. Simply open a web browser to https://console.cloud.google.com/bigquery. The user interface should be fairly self-explanatory. If given the option, we highly recommend using the “Beta” version of the user interface. Some helpful documentation here: https://cloud.google.com/bigquery/docs/bigquery-web-ui

Important: The BigQuery Web UI at the above links will allow you to access Compass datasets from your local workstation. This access is not approved and will fire an alert with our security monitoring team. Only access Compass datasets from your Eureka App VM.

Via the Command Line

You can access BigQuery datasets using the “bq” command line utility. This is a powerful utility, and full documentation can be found here: https://cloud.google.com/bigquery/docs/bq-command-line-tool. Below are a few simple examples for common uses:


Examples: Exploring Data

See what datasets you can access in a project:

bq --project_id [project-name] ls

See what tables are in a dataset:

bq --dataset_id [project-name]:[dataset-name] ls

Show the schema of a table:

bq show [project-name]:[dataset-name].[table-name] 

Show the first few rows of a table:

bq head [project-name]:[dataset-name].[table-name] 


Examples: Querying Data

*Note 1: In this and all SELECT examples, if the name of the project that contains the data you are querying has a hyphen in it, you may need to surround any table identifiers with backticks, as follows: `[project-name]:[dataset-name].[table-name]`

*Note 2: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.


Execute a SELECT query from the command line and view the results:

 bq query --use_legacy_sql=false  “select (*) from [PROJECT]:[DATASET].[TABLE]”

Execute a SELECT query from a query that’s stored in a file (for more complex queries) and view the results:

cat [LOCAL-SQL-FILENAME] | bq query --use_legacy_sql=false


Examples: Downloading Data

*Note 1: In the examples below, [PROJECT]:[DATASET] refers to the project and dataset that contains the data you wish to query, not necessarily your own Eureka project.

*Note 2: The “bq query” command will return a maximum of 16,000 rows. For larger datasets, see the example for “bq extract”


Output the results of a SELECT command to a CSV file:

 bq query --use_legacy_sql=false --format=csv  "select (*) from [PROJECT]:[DATASET].[TABLE]" > result.csv

Export a table to a file in your Google Cloud Storage Staging Bucket:

 bq extract --destination_format CSV --field_delimiter “,”  [PROJECT]:[DATASET].[TABLE] gs://[EUREKA-PROJECT]-staging/[FILENAME]


Copy a file from your Google Cloud Storage Staging Bucket to your App VM:

gsutil cp gs://[EUREKA-PROJECT]-staging/[FILENAME] [FILENAME]


Export the results of a large query (>16,000 resulting rows) to your BigQuery Staging Dataset:

1 . Query the data and store the results in a new BigQuery table.

bq query --use_legacy_sql=false –destination_table [EUREKA-PROJECT-ID]:staging.[TABLE] "select (*) from [PROJECT]:[DATASET].[TABLE]"

2. Use the instructions above to export the new table to a file in Google Cloud Storage.

3. Use the instructions above to copy the file from your Google Cloud Storage Staging Bucket to your App VM.

Idle Shutdown of VM

Each Eureka instance is pre-configured to shut down the VM after 30minutes of undetected usage of the VM. If you want to temporarily disable the idle shut down, run the following command from your VM terminal window:

  • sudo systemctl stop idleshutdown

If you disable the idle shut down, you are responsible for manually shutting down the VM if you are not longer using it.

The pre-configured idle shutdown will be re-enabled anytime the VM is rebooted, until then you will need to manually shut down the VM.