NGC v Google Cloud TPUs

Cloud TPU

$4.50 USD per TPU per hour.

(preemptive $1.35 USD per TPU per hour)

Pricing

AWS Elastic GPU Cost

eg1.2xlarge, 8 GiB, $0.400/hour

g3.16xlarge 64 188 488 GiB, EBS Only, $4.56 per Hour

NOTES

NEED TO USE us-central1-f ZONE!!

ctpu --zone us-central1-f -log-http ls

You need a Google Storage Bucket to store models and what not

If you want to delete it:

gsutil rm -r gs://YOUR-BUCKET-NAME

Tutorial Link

Here

Questions

Tensorflow vs PyTorch

Tensorboard

Setup link for Google Cloud (connected to a Compute Engine VM and Cloud TPU)

Benchmark Comparisons

According to this link, TPUs are faster and also cheaper when accounting for time saved.
Another link here.

Tensorflow Research Cloud Program

"If you're enrolled in the TFRC program you are granted access to Cloud TPU for a limited period of time free of charge. You are not charged for Cloud TPU as long as your TPUs are running in zone us-central1-f."

How TPUs Work Doc

Link here

Tensorflow

- TPUs are faster

- Drawbacks: need to adhere to the Estimator framework. Which is more complex than PyTorch. Definitely not as flexible as PyTorch

- Static computation graph (vs dynamic), which means things like variable sized inputs aren't well handled. You can't just use like a for loop.

The compute graph is created NOT on the host machine, but rather the TPU device itself. As such, you can't initialize the session. This is really unfortunate because I use this functionality a lot when I'm trying to debug what's going on.

- I think TF Estimator framework suffers from overengineering. Lots of subclassing going on, so it can be hard to follow

- tfdecode_raw int8 not supported on TPU

- Really hard to debug when things go wrong.

- Not well documented. For example, the TFRecord

tf.train.Example, tf.train.Features, community seems fairly frustrated about this as well ().

Commands

TO RUN Resnet on TPU

python resnet_main.py --tpu=$TPU_NAME --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet --precision='float32' --model_dir=${STORAGE_BUCKET}/resnet

Input Transformations

INPUTS:

<tf.Tensor 'truediv:0' shape=(128, 512, 512, 3) dtype=float32>

INITIAL CONV:

Tensor("initial_conv:0", shape=(128, 256, 256, 64), dtype=float32)

INITIAL POOL:

Tensor("initial_max_pool:0", shape=(128, 128, 128, 64), dtype=float32)

AFTER BLOCK 1 (num filters 64, stride 1):

Tensor("block_group1:0", shape=(128, 128, 128, 256), dtype=float32)

AFTER BLOCK 2 (num filters 128, stride 2):

Tensor("block_group2:0", shape=(128, 64, 64, 512), dtype=float32)

AFTER BLOCK 3 (num filters 256, stride 2):

Tensor("block_group3:0", shape=(128, 32, 32, 1024), dtype=float32)

AFTER BLOCK 4 (num filters 512, stride 2):

Tensor("block_group4:0", shape=(128, 16, 16, 2048), dtype=float32)

After average pooling

<tf.Tensor 'final_avg_pool:0' shape=(128, 1, 1, 2048) dtype=float32>

Final Dense

<tf.Tensor 'final_dense:0' shape=(128, 1000) dtype=float32>

Header

Within Block Fns

START

<tf.Tensor 'initial_max_pool:0' shape=(128, 56, 56, 64) dtype=float32>

<tf.Tensor 'Relu_3:0' shape=(128, 56, 56, 256) dtype=float32>

<tf.Tensor 'Relu_6:0' shape=(128, 56, 56, 256) dtype=float32>

Dataset

Taskonomy dataset link

Google TPU Resnet Tutorial

link

Encoder

AFTER BLOCK 4 (num filters 512, stride 2):

Tensor("block_group4:0", shape=(128, 16, 16, 2048), dtype=float32)

Translates To:

<tf.Tensor 'conv2d_53/Conv2D:0' shape=(128, 16, 16, 8) dtype=float32>

Decoder

After First Expansion to 128 channels:

<tf.Tensor 'Relu_49:0' shape=(128, 16, 16, 128) dtype=float32>

Now Need To Go To

<tf.Tensor 'truediv:0' shape=(128, 256, 256, 3) dtype=float32>

Optimization Performed

TFRecords - they're loaded NOT from the instance, but rather from the TPU itself (lazily). From a bucket. Then deserialized, then parsed. I used TFRecords which are apparently the paradigm here for fast performance.

- They do a lot of preprocessing to the images before putting them.

- Spent a large chunk of time serializing and deserializing

- Serializing an image into numpy array to then be deserialized by Tensorflow on TPU is a pain.

Memory Issues

Batch size of 8. Finished training up to step 1251. Elapsed seconds 302.

Batch size of 80. 1251 steps. Elapsed seconds 503.

tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space hbm. Used 10.38G of 8.00G hbm. Exceeded hbm capacity by 2.38G.

New Note

Justin Johnson's setup:

https://www.reddit.com/r/MachineLearning/comments/8g3akw/d_whats_your_setup_for_training_ml_models_in_the/

Other Cloud Provider

https://www.paperspace.com/

TODO: Google Cloud

- Pricing for comparable p3 instance

- connect with Sasha

-

Loss construction

- why are we solving

-

MultiLoss Project

- a

-

Task Space Project

- d

- heireustic reductions

- Input: Fixing the first space

- Output: other domains

- Finding the task means finding output spaces

-