Fast nano-gpt training | Zhutao Zhuang

In this blog I will document my code and experiment results following karparthy’s latest gpt-2 training tutorial video.

[https://youtu.be/l8pRSuU81PU?si=2sEmtmn56XBMPTbU][https://youtu.be/l8pRSuU81PU?si=2sEmtmn56XBMPTbU]

Code repo url: https://github.com/BilyZ98/nano-gpt

What is TFLOPs ? Tera floating point operations per second.

Share weight of lm_head and weight_token_embedding This weight share helps to save memory. This is huge amount of memory. It’s n_embed * vocab_size

Lower precision

Default tensor precision:

model device cuda:0
m device cuda:0
torch.float32

Without TF32: Memory usage:

Every 1.0s: nvidia-smi                                                                                                               Sat Jul  6 16:51:57 2024

Sat Jul  6 16:51:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   71C    P0   309W / 300W |   5359MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   29C    P0    46W / 300W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

Output:

step <built-in function iter>: train loss1.1116, val loss 1.1166
step 4500, loss: 1.1580688953399658, dt: 113.35ms, tok/sec: 144546.60
Time taken: 739.9126691818237 seconds
Total parameters: 10921049
Trainable parameters: 10921049
        ex is consists of centuries and want-wetlife science method ankners, from the ways wheredco by Benrys, where you're rate ar harm browling preservation in musicians. In Athletes have continuies to munically, and effects create cro-intricate jaz towantical navigating vail respectives to diseariety.

**Befor Hera's Players**

Prannit**
Infaming the game's early which commercial quirtual health, and mobile exammatring sound in football in pain this visualization involvement, is home to shapel key th

TF32: Memory usage:

Every 0.1s: nvidia-smi                                                                                                                          Sat Jul  6 17:11:38 2024

Sat Jul  6 17:11:38 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   64C    P0   309W / 300W |   5359MiB / 81920MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   31C    P0    47W / 300W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

Output

step 4998, loss: 1.1807901859283447, dt: 30.66ms, tok/sec: 534324.52
step 4999, loss: 1.1486560106277466, dt: 30.19ms, tok/sec: 542632.14
Time taken: 315.7020351886749 seconds
Total parameters: 10921049
Trainable parameters: 10921049


* Nuriety pattern to Play blogge: The misssiol strategies and transfolk to wo halley focused our work's own forecative ar unique planning. This articiples in the Laine Golden

The Host Potential's top init's experienterment and its particular type and regional warveillance. In this article, I'll shedd a fascinat words contributies and other athlete's early what draw from genre, col up jands, lim body eparthory, there ancient work of pairitization criticallysis.

**Addition Enthusiast Players**
~

Memory usage does not change much. But the training speed is much faster.

With BF16 Memory usage

Every 0.1s: nvidia-smi                                                                                                                          Sat Jul  6 17:19:50 2024

Sat Jul  6 17:19:50 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   61C    P0   280W / 300W |   5493MiB / 81920MiB |     98%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   32C    P0    47W / 300W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

Output:

step <built-in function iter>: train loss1.1083, val loss 1.1138
step 4500, loss: 1.149060606956482, dt: 46.36ms, tok/sec: 353409.81
Time taken: 323.1115171909332 seconds
Total parameters: 10921049
Trainable parameters: 10921049


* Thre-known variety is Mobile time forishing and a Piece**

Lerious la trend ammach focusion solus, the ways can date on how theselves presents, and cleaning ung time. While the Long-Anti Central to mitigate efforts, referred time

1. The Ritowastic Qrench Literian ass make-had a numberous unfounded women's explore a court of texture or the engage's earching try of from generating secrets. This benefit for a deterrmining movie excebrations, which its form its applyestriants. By regulating the

It’s not faster. Why is that?

+ troch.compile()

Got this error duing runtime. So I downgrade the python version to 3.11

Traceback (most recent call last):
  File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 236, in <module>
    model = torch.compile(model)
            ^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.12/site-packages/torch/__init__.py", line 1868, in compile
    raise RuntimeError("Dynamo is not supported on Python 3.12+")
RuntimeError: Dynamo is not supported on Python 3.12+

Got another error

/tmp/tmpd5t0oroc/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpd5t0oroc/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpd5t0oroc/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpd5t0oroc/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpd5t0oroc/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp0td94o5b/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmp0td94o5b/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^

I install latest gcc version . Old version is 4.8.5 Fix error above.

Output

step <built-in function iter>: train loss1.1037, val loss 1.1087
step 4500, loss: 1.151484727859497, dt: 23.18ms, tok/sec: 706728.75
Time taken: 285.365624666214 seconds
Total parameters: 10921049
Trainable parameters: 10921049
         emphasizing the parming of a spirit on the development conouches tended from the factor, such as jailust power traffice, and ixo, these can serve the alternatively takes part in processed.

**4. Inform 4applications Movements**

1. **Mush Secrets:** Students in Parleers (airl Warrioring change: Bhern Keshedmn) world, as initial do other basebar from leagues, to employ create messaters and their immpact on our camido opportunities. Its gain political windows, proteins and resoling affian languag
~

+ Flash attention

Karpathy gives a brief introduction to flash attention in his video. Check out this video to know more . https://youtu.be/l8pRSuU81PU?t=7521

Code:

class Head(nn.Module):
    """One head of self-attention"""
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x) #(B, T, C)
        q = self.query(x) #(B, T, C)
        v = self.value(x) #(B, T, C)

        # wei = q @ k.transpose(-2, -1) * C **-0.5
        # wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        # wei = F.softmax(wei, dim=-1) # (B, T, T)
        # wei = self.dropout(wei)
        # out = wei @ v #(B,T,T) @ ( B, T, C) -> (B, T, C)
        out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        return out

Output

step <built-in function iter>: train loss1.3369, val loss 1.3376
step 4500, loss: 1.3740415573120117, dt: 22.38ms, tok/sec: 731945.94
Time taken: 288.25680232048035 seconds
Total parameters: 10921049
Trainable parameters: 10921049
        facilitation slowing technique it.

In conclusion, is jilitary Ketins and scale, coreting the programphs of cocentries, many only soidhts:

1. **Ingluentalle performances**: Epchilitations and dinactions has functively cardolles experience witting, significantlyqtually.
3. **Pobitice, SStédio micrositiona Ristice**: The Presed full fame leintarians foreices, day exemisions and community to provide a more endaging of his play byautices. The New clear Ara Case gor as mains, Aheragement textures th

Throughput increases 4\% compared to the previous version which is not a lot. I guess this is because the model is not big enough to benefit from the flash attention.

Use distributed data parallel(DDP)

Issue: I can only run with 4 GPUs. If I run with 8 GPUs, I got this error when I try to use all 8 gpus:

ddp_world_size 8
[rank4]: Traceback (most recent call last):
[rank4]:   File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 48, in <module>
[rank4]:     torch.cuda.set_device(device)
[rank4]:   File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank4]:     torch._C._cuda_setDevice(device)
[rank4]: RuntimeError: CUDA error: invalid device ordinal
[rank4]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank4]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank4]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

E0707 18:10:42.318000 47618031484288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 4 (pid: 52464) of binary: /GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/python                                                                                         Traceback (most recent call last):                                                                                                                    File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/torchrun", line 33, in <module>                                                              sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures:
[1]:
  time      : 2024-07-07_18:10:42
  host      : gpu72
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 52465)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-07_18:10:42
  host      : gpu72
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 52466)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html                                                          [3]:
  time      : 2024-07-07_18:10:42
  host      : gpu72
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 52467)
  error_file: <N/A>                                                                                                                                   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------                                                                                        Root Cause (first observed failure):                                                                                                                [0]:
  time      : 2024-07-07_18:10:42                                                                                                                     host      : gpu72
  rank      : 4 (local_rank: 4)                                                                                                                       exitcode  : 1 (pid: 52464)                                                                                                                          error_file: <N/A>                                                                                                                                   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html                                                          ============================================================

Actuall my task is assigned only 4 gpus even thought there are 8 gpus on the compute node. I run following code to get number of available cuda device I can use and it outputs 4.

import torch

print('cuda device count', torch.cuda.device_count())

(nano-gpt) [nsccgz_qylin_1@ln102%tianhe2-K gpt-dev]$ yhrun -n 1 -N 1 -p GPU_A800 python test_gpt_count.py
cuda device count 4

Asked bing and it gives a post that I can set visible cuda devices to all 8 gpus. And then I check this env variable and found that it only outputs 0,1,2,3.

So this is the reason why I can not use all 8 gpus.

Problem fixed.

(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ python test_gpt_count.py
cuda device count 4
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ echo $CUDA_VISIBLE_DEVICES
0,1,2,3
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ echo $CUDA_VISIBLE_DEVICES^C
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES},4,5,6,7

Rewrite dataloader.

Since now multiple processes are introduced to use multiple gpus, I need to rewrite the dataloader to make it work with multiple processes.

Code: self.process_rank means current process rank. self.num_process means total number of processes.

self.current_idx is the current index of the data that the dataloader is reading. Note that each process will have it own data to train which is different from other processes. So self.current_idx moves forward by B * T * self.num_process each time.

class DataLoader:
    def __init__(self, data, B, T,  process_rank=0, num_process=1):
        self.B = B
        self.T = T
        self.process_rank = process_rank
        self.num_process = num_process
        self.current_idx = self.B * self.T * self.process_rank
        self.data = data

    def __len__(self):
        return len(self.data)

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.data[self.current_idx:self.current_idx + B * T + 1]
        x = (buf[:-1]).view(B, T)
        y = (buf[1:]).view(B, T)
        self.current_idx += B * T * self.num_process
        if self.current_idx + B * T * self.num_process + 1> len(self.data):
            self.current_idx = self.B * self.T * self.process_rank
        return x, y

It takes twice longer time to finish the job after swtiching to new data loader which is weird. I don’t know why. Is this because of cache miss?

I think it’s because of the cache miss. Token throughput after switching to new dataloader

vim  slurm-2769516.out
m device cuda:0
step <built-in function iter>: train loss4.6247, val loss 4.6222
step 0, loss: 6.304970741271973, dt: 253517.79ms, tok/sec: 64.63
step <built-in function iter>: train loss2.7204, val loss 2.7159
step 500, loss: 2.7232675552368164, dt: 22.78ms, tok/sec: 719357.23
step <built-in function iter>: train loss2.6621, val loss 2.6616

Token throughput before switching to new dataloader

vim  slurm-2769041.out
step <built-in function iter>: train loss4.6102, val loss 4.6073
step 0, loss: 6.303163528442383, dt: 105310.21ms, tok/sec: 155.58
step <built-in function iter>: train loss2.6170, val loss 2.6169
step 500, loss: 2.6428515911102295, dt: 22.64ms, tok/sec: 723576.18

It takes even longer after I move to(device) out of dataloader

tep <built-in function iter>: train loss2.2721, val loss 2.2711
step 4000, loss: 2.2670536041259766, dt: 22.41ms, tok/sec: 730972.72
step <built-in function iter>: train loss1.9989, val loss 1.9980
step 4500, loss: 2.078113079071045, dt: 22.69ms, tok/sec: 722123.90
Time taken: 651.469042301178 seconds
Total parameters: 10921049
Trainable parameters: 10921049                                                                                                                               the melorme slocites apctature to bayysic impessare to to obaytening ridence's, comuties the prayer'res of conngibuidiktle tooly soudiestinabo, creation, lov expertance entifuchitule, stluch pronuctingmat arous on vismok dailioout ateries tipl,, aphige sthatection of ecpppppivaclliency, powanct transps, and the owen, freper. Coulw fame inidtifinge for falstadry exertwing to socinigys the pecos, reame peound to its daman lawabyang ammbots'res coltual ra casergod asem, yucial gor mann textions mu

4 GPUs for ddp

Memory usage

(base) [nsccgz_qylin_1@gpu72%tianhe2-K ~]$ nvidia-smi
Wed Jul 10 17:18:59 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   45C    P0   145W / 300W |   3975MiB / 81920MiB |     44%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   48C    P0   133W / 300W |   3979MiB / 81920MiB |     45%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  On   | 00000000:53:00.0 Off |                    0 |
| N/A   47C    P0   160W / 300W |   3979MiB / 81920MiB |     46%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   49C    P0   134W / 300W |   3959MiB / 81920MiB |     44%      Default |
|                               |                      |             Disabled |

Every 1.0s: nvidia-smi                                                                                                                                                                    Wed Jul 10 17:37:07 2024

Wed Jul 10 17:37:07 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   32C    P0    67W / 300W |   1117MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   35C    P0    69W / 300W |   1117MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  On   | 00000000:53:00.0 Off |                    0 |
| N/A   34C    P0    66W / 300W |   1117MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   36C    P0    69W / 300W |   1117MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

Memory usage gradually increases. Why is this?

Issue: Getting error that nccl report heartbeat error

[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
E0710 20:29:48.177000 47209663704448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1882) of binary: /GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/python
Traceback (most recent call last):

I think this is because I did not add this line of code

    with torch.autocast(device_type=device_type, dtype=torch.bfloat16):

This code does not fix the problem.

I am now trying to first call destroy_process_group and then calling tokens generation to fix issue above.

I should decrease count of training iteration so that I can get error as early as possible.

It takes 16 mins to finish model training which is not normal. I wonder why does it take so long after switching to new dataloader.

Can this be a research problem?

Why does it take so long to generate tokens after training is finished?.

Is this because of decode part or is this because of the model inference part?

Issue: Get this error saying that expected all tensors to be on the same device.

  File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 292, in generate_tokens
    logits, loss = model(idx_cond) # (B,T,vocab_size)
                   ^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 255, in forward
    tok_emb = self.token_embedding_table(idx) #(B,T,C)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Don’t know which tensor is not on cuda. So I print device of idx to see if it’s on cuda.

Issue above is sovled after I update the code like this

    idx = torch.zeros((1, 1), dtype=torch.long, device=device )

cuda:0 do use lots of memory while other gpus have freed up memory.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   36C    P0    73W / 300W |   4867MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   32C    P0    47W / 300W |     27MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  On   | 00000000:53:00.0 Off |                    0 |
| N/A   32C    P0    44W / 300W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   34C    P0    47W / 300W |      7MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

Still can not generate tokens.

I think this is because that I did not call model forward for all processes.

The code runs successfully after I call model forward for all processes for generating text.

It does not work even I set model.eval() only for master process. Why is that?

Here’s the code that can finish successfully without process hanging

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for step in range(max_iters):
    t0 = time.time()
    # xb, yb = get_batch('train')
    xb, yb = data_loader.next_batch()
    xb, yb = xb.to(device), yb.to(device)
    B, T = xb.shape
    model.train()
    optimizer.zero_grad(set_to_none=True)

    # with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
    logits, loss = model(xb, yb)

    # if ddp:
    #     model.require_backward_grad_sync = True
    loss.backward()

    if ddp:
        dist.all_reduce(loss, op=dist.ReduceOp.AVG)

    optimizer.step()
    if device_type == "cuda":
        torch.cuda.synchronize() # wait for the GPU to finish work

    t1 = time.time()
    token_processed = B * T * ddp_world_size
    #if step % eval_interval == 0 :
    #    losses = estimate_loss()
    #    print(f"step {iter}: train loss{losses['train']:.4f}, val loss {losses['val']:.4f}")

    dt = (t1 - t0) * 1000 # milli sec
    token_per_sec = token_processed/ (t1-t0)
    call_generate_tokens(model)                                                                                                  print(f'step {step}, loss: {loss.item()}, dt: {dt:.2f}ms, tok/sec: {token_per_sec:.2f}')
    if master_process:

It looks like from this discuss post that I have to call forward and backwad for all processes. https://discuss.pytorch.org/t/multiple-forward-functions-in-dp-and-ddp/135029/5

No DDP with 1 gpu + new data loader + tf32, no torch.compile

It’s pretty fast though.

Output:

step 4998, loss: 1.5543745756149292, dt: 44.11ms, tok/sec: 371476.71
idx device: cuda:0
rank: 0,         broving and mited rocal hand alchiqe, Ond Wastestern have alter numpre, and society.

**Conclusion**

When 103, hith htreate analyzing into talent of the nets encerlated indituted to degensive sports and projote About to  trace producial stratege. Os a care a coved legelad interacting harswe her in the intricacies, with history and winnlowedge in 140s has bhe Bavill –as cit) Éanta.

**Upproach Inper To Adaptation**

The commmal tOxerium boodls to dimerstry of pinemanical the industry to interes
step 4999, loss: 1.5449110269546509, dt: 44.11ms, tok/sec: 371424.51
Time taken: 301.4059376716614 seconds
Total parameters: 10921049
Trainable parameters: 10921049

DDP with 4 gpus , no torch.compile, tf32

Output:

step 4998, loss: 1.2392717599868774, dt: 85.30ms, tok/sec: 768330.46
idx device:idx device:idx device:idx device:    cuda:3cuda:0cuda:1cuda:2



rank: 1,         Choroes, and formats. In this a
rank: 3,         Choroes, and formats. In this a
rank: 2,         Choroes, and formats. In this a
rank: 0,         Choroes, and formats. In this a
step 4999, loss: 1.2737213373184204, dt: 86.10ms, tok/sec: 761164.76
Time taken: 442.67850971221924 seconds
Total parameters: 10921049
Trainable parameters: 10921049

I don’t know how FSDP( Fully sharded data parallel ) works yet.

Weight sharing

Lower precision

+ troch.compile()

+ Flash attention

Use distributed data parallel(DDP)

4 GPUs for ddp

Enjoy Reading This Article?