Debugging Reflow

Debugging multiple jobs from reflow runbatch

How to capture standard error (stderr) if all your output disappears

If you're in screen, you may not be able to see all output because it disappears before you have a chance. Then it's good to redirect the stderr to a file so you can at least inspect it:

✘  Sat  9 Jun - 00:11  ~/kmer-hashing/sourmash/maca/facs   origin ☊ master ✔ 2☀ 
 ubuntu@olgabot-reflow  reflow runbatch 2&> err

 ✘  Sat  9 Jun - 00:11  ~/kmer-hashing/sourmash/maca/facs   origin ☊ master ✔ 2☀ 
 ubuntu@olgabot-reflow  less err
reflow: batch program ~/reflow-workflows/sourmash.rf runsfile samples.csv
reflow: batch program ~/reflow-workflows/sourmash.rf runsfile samples.csv
open /home/ubuntu/.reflow/runs/12f9a53350fa89e25c121c2b56c2e22a92ce923c55656dbe776432c974ff3f38.json: too many open files
open /home/ubuntu/.reflow/runs/12f9a53350fa89e25c121c2b56c2e22a92ce923c55656dbe776432c974ff3f38.json: too many open files

How to check on your running batch

First, open a new screen or tmux window. In this example, my batch information is in the folder ~/kmer-hashing/sourmash/maca/facs so we'll change to that directory since it contains the config.json:

cd ~/kmer-hashing/sourmash/maca/facs
reflow batchinfo

How to look at the last 20 lines of ALL log files

The last 20 lines are usually pretty informative to show where in the process all your samples are. It's best to pipe that output into less -S so it's easy to scroll through since there's a bunch of output.

tail -n 20 log.* | less -S

If you get a "too many files" error, use xargs and do this instead:

ls -1 | grep -E '^log\.' | tail -n 20 | less -S

How to look at nonzero log files

You will have a LOT of log files ... which is annoying. To look only at the nonzero ones, you can filter the ls -lha output for only files that are kilobytes or more, i.e. {some digit}K, e.g. grep -P '\dK' and take the last column (column 9), then feed all of those files into tail with xargs and use less so you can page through everything:

 ls -lha | grep -P '\dK' | grep log | cut -f 9 -d ' '| xargs tail -n 50 | less

How to look at the created files

It can also be useful to take a peek at the files as they're getting created. In this particular workflow, we're creating a {sample_id}.signature file for every sample, in the s3 bucket s3://olgabot-maca/facs/sourmash/. We can look at the growing file list there by doing aws s3 ls of the bucket and then counting the lines with wc -l :

 Mon 11 Jun - 19:31  /mnt/data/maca/facs 
 ubuntu@olgabot-reflow  aws s3 ls s3://olgabot-maca/facs/sourmash/ | wc -l
1086

What to do when you have a bunch of jobs "waiting"

If you ran reflow runbatch and then walk away, you'd hope that all your jobs are done when you get back. Unfortunately, only a few jobs may have finished, as in this example where unique_prefixes is a list of S3 folders that should have outputs:

Check the output of reflow listbatch which if your batch failed, will show that there's a bunch of jobs "waiting." An example excerpt is below:

 ✘  Tue 26 Jun - 21:39  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow listbatch
A1-B002427-3_39_F-1-1_trim=false_scaled=100     12af8e26 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1000    9568d8b3 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1100    6bcd3d1a waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1200    8508d407 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1300    c38167a2 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1400    297c2a9f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1500    6daec549 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1600    345e906f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1700    1d2d96df waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1800    c2031915 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1900    8aefd12a waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=200     273354de waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=2000    d026b17e waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=2500    9e06336f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=300     dee9cefc waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=3000    0efb3c07 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=3500    61df5887 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=400     818f00a6 waiting

!!! Solution: run reflow runbatch -retry which will rerun your jobs !!!

How to force rerunning of a workflow

WHen Reflow finishes successfully, it then considers the job done and caches the result. To force re-running the job without the cache, use either:

reflow run -eval=bottomup -invalidate=.* myworkflow.rf

or

reflow -cache=off run myworkflow.rf

Debugging single jobs - inspect running and dead jobs

Shell into current<strong>exec</strong> environment

While an exec is running, you can shell into its environment with reflow shell; get the exec uri via reflow ps -l, then pass that into reflow shell. Open a new terminal when you do this :)

for example,

reflow shell ec2-54-214-227-181.us-west-2.compute.amazonaws.com:9000/f1d4fc064c7a85c8/f046b4086edc0cee84b46d633a43fff01d203d4b3c92442cf9a77d0d7276f000

SSH into a running instance

If you have a public SSH key in ~/.ssh/id_rsa.pub, then this will be automatically installed on the Reflow instances, and you can ssh in to each instance (under the user "core"), e.g.,: ssh core@ ec2-54-214-227-181.us-west-2.compute.amazonaws.com

Retrieve intermediate files

You can retrieve files that were produced by immediate stages by using reflow cat, e.g., reflow cat sha256:... > myfile if you want to inspect these.

Fixing "remote error: tls: bad certificate"

If you get an error that looks like this:

 Wed 28 Nov - 15:19  ~/code/reflow-workflows   origin ☊ olgabot/bedtools 1⚙ 1● 
  reflow run bedtools.rf     
reflow: run ID: 3b527e93
reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-51-122.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-51-122.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-35-167-37-93.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-35-167-37-93.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-35-167-37-93.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-35-167-37-93.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-52-40-64-53.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-52-40-64-53.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-52-40-64-53.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-52-40-64-53.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-54-189-82-93.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-54-189-82-93.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-54-189-82-93.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-54-189-82-93.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-216-195-169.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-216-195-169.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-216-195-169.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-216-195-169.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-211-39-196.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-211-39-196.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-211-39-196.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-211-39-196.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-54-187-47-140.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-54-187-47-140.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-54-187-47-140.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-54-187-47-140.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-219-99-230.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-219-99-230.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-34-219-99-230.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-219-99-230.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-52-35-166-218.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-52-35-166-218.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-52-35-166-218.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-52-35-166-218.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-162-121.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-162-121.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-162-121.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-162-121.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: remote error: tls: bad certificate
ec2cluster: 0 instances:  (<=$0.0/hr), total{}, waiting{mem:2.0GiB cpu:1 disk:1.0GiB}, pending{mem:3.7GiB cpu:2 disk:250.0GiB intel_avx512:2}
  allocate {mem:2.0GiB cpu:1 disk:1.0GiB}:  provisioning new instance                  49m28s
  i-05f695bddc8609e65:                      waiting for reflowlet to become available  2m4s

Delete your local reflow.pem file:

rm -rf ~/.reflow/reflow.pem