Fast nano-gpt training
In this blog I will document my code and experiment results following karparthy’s latest gpt-2 training tutorial video.
[https://youtu.be/l8pRSuU81PU?si=2sEmtmn56XBMPTbU][https://youtu.be/l8pRSuU81PU?si=2sEmtmn56XBMPTbU]
Code repo url: https://github.com/BilyZ98/nano-gpt
What is TFLOPs ? Tera floating point operations per second.
Weight sharing
Share weight of lm_head
and weight_token_embedding
This weight share helps to save memory. This is huge amount of memory. It’s n_embed * vocab_size
Lower precision
Default tensor precision:
model device cuda:0
m device cuda:0
torch.float32
Without TF32: Memory usage:
Every 1.0s: nvidia-smi Sat Jul 6 16:51:57 2024
Sat Jul 6 16:51:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 71C P0 309W / 300W | 5359MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 29C P0 46W / 300W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
Output:
step <built-in function iter>: train loss1.1116, val loss 1.1166
step 4500, loss: 1.1580688953399658, dt: 113.35ms, tok/sec: 144546.60
Time taken: 739.9126691818237 seconds
Total parameters: 10921049
Trainable parameters: 10921049
ex is consists of centuries and want-wetlife science method ankners, from the ways wheredco by Benrys, where you're rate ar harm browling preservation in musicians. In Athletes have continuies to munically, and effects create cro-intricate jaz towantical navigating vail respectives to diseariety.
**Befor Hera's Players**
Prannit**
Infaming the game's early which commercial quirtual health, and mobile exammatring sound in football in pain this visualization involvement, is home to shapel key th
TF32: Memory usage:
Every 0.1s: nvidia-smi Sat Jul 6 17:11:38 2024
Sat Jul 6 17:11:38 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 64C P0 309W / 300W | 5359MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 31C P0 47W / 300W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
Output
step 4998, loss: 1.1807901859283447, dt: 30.66ms, tok/sec: 534324.52
step 4999, loss: 1.1486560106277466, dt: 30.19ms, tok/sec: 542632.14
Time taken: 315.7020351886749 seconds
Total parameters: 10921049
Trainable parameters: 10921049
* Nuriety pattern to Play blogge: The misssiol strategies and transfolk to wo halley focused our work's own forecative ar unique planning. This articiples in the Laine Golden
The Host Potential's top init's experienterment and its particular type and regional warveillance. In this article, I'll shedd a fascinat words contributies and other athlete's early what draw from genre, col up jands, lim body eparthory, there ancient work of pairitization criticallysis.
**Addition Enthusiast Players**
~
Memory usage does not change much. But the training speed is much faster.
With BF16 Memory usage
Every 0.1s: nvidia-smi Sat Jul 6 17:19:50 2024
Sat Jul 6 17:19:50 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 61C P0 280W / 300W | 5493MiB / 81920MiB | 98% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 32C P0 47W / 300W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
Output:
step <built-in function iter>: train loss1.1083, val loss 1.1138
step 4500, loss: 1.149060606956482, dt: 46.36ms, tok/sec: 353409.81
Time taken: 323.1115171909332 seconds
Total parameters: 10921049
Trainable parameters: 10921049
* Thre-known variety is Mobile time forishing and a Piece**
Lerious la trend ammach focusion solus, the ways can date on how theselves presents, and cleaning ung time. While the Long-Anti Central to mitigate efforts, referred time
1. The Ritowastic Qrench Literian ass make-had a numberous unfounded women's explore a court of texture or the engage's earching try of from generating secrets. This benefit for a deterrmining movie excebrations, which its form its applyestriants. By regulating the
It’s not faster. Why is that?
+ troch.compile()
Got this error duing runtime. So I downgrade the python version to 3.11
Traceback (most recent call last):
File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 236, in <module>
model = torch.compile(model)
^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.12/site-packages/torch/__init__.py", line 1868, in compile
raise RuntimeError("Dynamo is not supported on Python 3.12+")
RuntimeError: Dynamo is not supported on Python 3.12+
Got another error
/tmp/tmpd5t0oroc/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpd5t0oroc/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpd5t0oroc/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpd5t0oroc/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpd5t0oroc/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmp0td94o5b/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmp0td94o5b/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
I install latest gcc version . Old version is 4.8.5 Fix error above.
Output
step <built-in function iter>: train loss1.1037, val loss 1.1087
step 4500, loss: 1.151484727859497, dt: 23.18ms, tok/sec: 706728.75
Time taken: 285.365624666214 seconds
Total parameters: 10921049
Trainable parameters: 10921049
emphasizing the parming of a spirit on the development conouches tended from the factor, such as jailust power traffice, and ixo, these can serve the alternatively takes part in processed.
**4. Inform 4applications Movements**
1. **Mush Secrets:** Students in Parleers (airl Warrioring change: Bhern Keshedmn) world, as initial do other basebar from leagues, to employ create messaters and their immpact on our camido opportunities. Its gain political windows, proteins and resoling affian languag
~
+ Flash attention
Karpathy gives a brief introduction to flash attention in his video. Check out this video to know more . https://youtu.be/l8pRSuU81PU?t=7521
Code:
class Head(nn.Module):
"""One head of self-attention"""
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
k = self.key(x) #(B, T, C)
q = self.query(x) #(B, T, C)
v = self.value(x) #(B, T, C)
# wei = q @ k.transpose(-2, -1) * C **-0.5
# wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
# wei = F.softmax(wei, dim=-1) # (B, T, T)
# wei = self.dropout(wei)
# out = wei @ v #(B,T,T) @ ( B, T, C) -> (B, T, C)
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
return out
Output
step <built-in function iter>: train loss1.3369, val loss 1.3376
step 4500, loss: 1.3740415573120117, dt: 22.38ms, tok/sec: 731945.94
Time taken: 288.25680232048035 seconds
Total parameters: 10921049
Trainable parameters: 10921049
facilitation slowing technique it.
In conclusion, is jilitary Ketins and scale, coreting the programphs of cocentries, many only soidhts:
1. **Ingluentalle performances**: Epchilitations and dinactions has functively cardolles experience witting, significantlyqtually.
3. **Pobitice, SStédio micrositiona Ristice**: The Presed full fame leintarians foreices, day exemisions and community to provide a more endaging of his play byautices. The New clear Ara Case gor as mains, Aheragement textures th
Throughput increases 4\% compared to the previous version which is not a lot. I guess this is because the model is not big enough to benefit from the flash attention.
Use distributed data parallel(DDP)
Issue: I can only run with 4 GPUs. If I run with 8 GPUs, I got this error when I try to use all 8 gpus:
ddp_world_size 8
[rank4]: Traceback (most recent call last):
[rank4]: File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 48, in <module>
[rank4]: torch.cuda.set_device(device)
[rank4]: File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank4]: torch._C._cuda_setDevice(device)
[rank4]: RuntimeError: CUDA error: invalid device ordinal
[rank4]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank4]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank4]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
E0707 18:10:42.318000 47618031484288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 4 (pid: 52464) of binary: /GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/python Traceback (most recent call last): File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Failures:
[1]:
time : 2024-07-07_18:10:42
host : gpu72
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 52465)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-07_18:10:42
host : gpu72
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 52466)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]:
time : 2024-07-07_18:10:42
host : gpu72
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 52467)
error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------ Root Cause (first observed failure): [0]:
time : 2024-07-07_18:10:42 host : gpu72
rank : 4 (local_rank: 4) exitcode : 1 (pid: 52464) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Actuall my task is assigned only 4 gpus even thought there are 8 gpus on the compute node. I run following code to get number of available cuda device I can use and it outputs 4.
import torch
print('cuda device count', torch.cuda.device_count())
(nano-gpt) [nsccgz_qylin_1@ln102%tianhe2-K gpt-dev]$ yhrun -n 1 -N 1 -p GPU_A800 python test_gpt_count.py
cuda device count 4
Asked bing and it gives a post that I can set visible cuda devices to all 8 gpus. And then I check this env variable and found that it only outputs 0,1,2,3
.
So this is the reason why I can not use all 8 gpus.
Problem fixed.
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ python test_gpt_count.py
cuda device count 4
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ echo $CUDA_VISIBLE_DEVICES
0,1,2,3
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ echo $CUDA_VISIBLE_DEVICES^C
(base) [nsccgz_qylin_1@gpu73%tianhe2-K gpt-dev]$ export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES},4,5,6,7
Rewrite dataloader.
Since now multiple processes are introduced to use multiple gpus, I need to rewrite the dataloader to make it work with multiple processes.
Code: self.process_rank
means current process rank. self.num_process
means total number of processes.
self.current_idx
is the current index of the data that the dataloader is reading. Note that each process will have it own data to train which is different from other processes. So self.current_idx
moves forward by B * T * self.num_process
each time.
class DataLoader:
def __init__(self, data, B, T, process_rank=0, num_process=1):
self.B = B
self.T = T
self.process_rank = process_rank
self.num_process = num_process
self.current_idx = self.B * self.T * self.process_rank
self.data = data
def __len__(self):
return len(self.data)
def next_batch(self):
B, T = self.B, self.T
buf = self.data[self.current_idx:self.current_idx + B * T + 1]
x = (buf[:-1]).view(B, T)
y = (buf[1:]).view(B, T)
self.current_idx += B * T * self.num_process
if self.current_idx + B * T * self.num_process + 1> len(self.data):
self.current_idx = self.B * self.T * self.process_rank
return x, y
It takes twice longer time to finish the job after swtiching to new data loader which is weird. I don’t know why. Is this because of cache miss?
I think it’s because of the cache miss. Token throughput after switching to new dataloader
vim slurm-2769516.out
m device cuda:0
step <built-in function iter>: train loss4.6247, val loss 4.6222
step 0, loss: 6.304970741271973, dt: 253517.79ms, tok/sec: 64.63
step <built-in function iter>: train loss2.7204, val loss 2.7159
step 500, loss: 2.7232675552368164, dt: 22.78ms, tok/sec: 719357.23
step <built-in function iter>: train loss2.6621, val loss 2.6616
Token throughput before switching to new dataloader
vim slurm-2769041.out
step <built-in function iter>: train loss4.6102, val loss 4.6073
step 0, loss: 6.303163528442383, dt: 105310.21ms, tok/sec: 155.58
step <built-in function iter>: train loss2.6170, val loss 2.6169
step 500, loss: 2.6428515911102295, dt: 22.64ms, tok/sec: 723576.18
It takes even longer after I move to(device)
out of dataloader
tep <built-in function iter>: train loss2.2721, val loss 2.2711
step 4000, loss: 2.2670536041259766, dt: 22.41ms, tok/sec: 730972.72
step <built-in function iter>: train loss1.9989, val loss 1.9980
step 4500, loss: 2.078113079071045, dt: 22.69ms, tok/sec: 722123.90
Time taken: 651.469042301178 seconds
Total parameters: 10921049
Trainable parameters: 10921049 the melorme slocites apctature to bayysic impessare to to obaytening ridence's, comuties the prayer'res of conngibuidiktle tooly soudiestinabo, creation, lov expertance entifuchitule, stluch pronuctingmat arous on vismok dailioout ateries tipl,, aphige sthatection of ecpppppivaclliency, powanct transps, and the owen, freper. Coulw fame inidtifinge for falstadry exertwing to socinigys the pecos, reame peound to its daman lawabyang ammbots'res coltual ra casergod asem, yucial gor mann textions mu
4 GPUs for ddp
Memory usage
(base) [nsccgz_qylin_1@gpu72%tianhe2-K ~]$ nvidia-smi
Wed Jul 10 17:18:59 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 45C P0 145W / 300W | 3975MiB / 81920MiB | 44% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 48C P0 133W / 300W | 3979MiB / 81920MiB | 45% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... On | 00000000:53:00.0 Off | 0 |
| N/A 47C P0 160W / 300W | 3979MiB / 81920MiB | 46% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... On | 00000000:57:00.0 Off | 0 |
| N/A 49C P0 134W / 300W | 3959MiB / 81920MiB | 44% Default |
| | | Disabled |
Every 1.0s: nvidia-smi Wed Jul 10 17:37:07 2024
Wed Jul 10 17:37:07 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 32C P0 67W / 300W | 1117MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 35C P0 69W / 300W | 1117MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... On | 00000000:53:00.0 Off | 0 |
| N/A 34C P0 66W / 300W | 1117MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... On | 00000000:57:00.0 Off | 0 |
| N/A 36C P0 69W / 300W | 1117MiB / 81920MiB | 0% Default |
| | | Disabled |
Memory usage gradually increases. Why is this?
Issue: Getting error that nccl report heartbeat error
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
E0710 20:29:48.177000 47209663704448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1882) of binary: /GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/bin/python
Traceback (most recent call last):
I think this is because I did not add this line of code
with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
This code does not fix the problem.
I am now trying to first call destroy_process_group
and then calling tokens generation to fix issue above.
I should decrease count of training iteration so that I can get error as early as possible.
It takes 16 mins to finish model training which is not normal. I wonder why does it take so long after switching to new dataloader.
Can this be a research problem?
Why does it take so long to generate tokens after training is finished?.
Is this because of decode part or is this because of the model inference part?
Issue: Get this error saying that expected all tensors to be on the same device.
File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 292, in generate_tokens
logits, loss = model(idx_cond) # (B,T,vocab_size)
^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/zt/gpt-dev/persona_gpt.py", line 255, in forward
tok_emb = self.token_embedding_table(idx) #(B,T,C)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
return F.embedding(
^^^^^^^^^^^^
File "/GPUFS/nsccgz_qylin_1/miniconda3/envs/nano-gpt/lib/python3.11/site-packages/torch/nn/functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Don’t know which tensor is not on cuda. So I print device of idx
to see if it’s on cuda.
Issue above is sovled after I update the code like this
idx = torch.zeros((1, 1), dtype=torch.long, device=device )
cuda:0 do use lots of memory while other gpus have freed up memory.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... On | 00000000:4F:00.0 Off | 0 |
| N/A 36C P0 73W / 300W | 4867MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... On | 00000000:50:00.0 Off | 0 |
| N/A 32C P0 47W / 300W | 27MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... On | 00000000:53:00.0 Off | 0 |
| N/A 32C P0 44W / 300W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... On | 00000000:57:00.0 Off | 0 |
| N/A 34C P0 47W / 300W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
Still can not generate tokens.
I think this is because that I did not call model forward for all processes.
The code runs successfully after I call model forward for all processes for generating text.
It does not work even I set model.eval()
only for master process. Why is that?
Here’s the code that can finish successfully without process hanging
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for step in range(max_iters):
t0 = time.time()
# xb, yb = get_batch('train')
xb, yb = data_loader.next_batch()
xb, yb = xb.to(device), yb.to(device)
B, T = xb.shape
model.train()
optimizer.zero_grad(set_to_none=True)
# with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
logits, loss = model(xb, yb)
# if ddp:
# model.require_backward_grad_sync = True
loss.backward()
if ddp:
dist.all_reduce(loss, op=dist.ReduceOp.AVG)
optimizer.step()
if device_type == "cuda":
torch.cuda.synchronize() # wait for the GPU to finish work
t1 = time.time()
token_processed = B * T * ddp_world_size
#if step % eval_interval == 0 :
# losses = estimate_loss()
# print(f"step {iter}: train loss{losses['train']:.4f}, val loss {losses['val']:.4f}")
dt = (t1 - t0) * 1000 # milli sec
token_per_sec = token_processed/ (t1-t0)
call_generate_tokens(model) print(f'step {step}, loss: {loss.item()}, dt: {dt:.2f}ms, tok/sec: {token_per_sec:.2f}')
if master_process:
It looks like from this discuss post that I have to call forward and backwad for all processes. https://discuss.pytorch.org/t/multiple-forward-functions-in-dp-and-ddp/135029/5
No DDP with 1 gpu + new data loader + tf32, no torch.compile
It’s pretty fast though.
Output:
step 4998, loss: 1.5543745756149292, dt: 44.11ms, tok/sec: 371476.71
idx device: cuda:0
rank: 0, broving and mited rocal hand alchiqe, Ond Wastestern have alter numpre, and society.
**Conclusion**
When 103, hith htreate analyzing into talent of the nets encerlated indituted to degensive sports and projote About to trace producial stratege. Os a care a coved legelad interacting harswe her in the intricacies, with history and winnlowedge in 140s has bhe Bavill –as cit) Éanta.
**Upproach Inper To Adaptation**
The commmal tOxerium boodls to dimerstry of pinemanical the industry to interes
step 4999, loss: 1.5449110269546509, dt: 44.11ms, tok/sec: 371424.51
Time taken: 301.4059376716614 seconds
Total parameters: 10921049
Trainable parameters: 10921049
DDP with 4 gpus , no torch.compile, tf32
Output:
step 4998, loss: 1.2392717599868774, dt: 85.30ms, tok/sec: 768330.46
idx device:idx device:idx device:idx device: cuda:3cuda:0cuda:1cuda:2
rank: 1, Choroes, and formats. In this a
rank: 3, Choroes, and formats. In this a
rank: 2, Choroes, and formats. In this a
rank: 0, Choroes, and formats. In this a
step 4999, loss: 1.2737213373184204, dt: 86.10ms, tok/sec: 761164.76
Time taken: 442.67850971221924 seconds
Total parameters: 10921049
Trainable parameters: 10921049
I don’t know how FSDP( Fully sharded data parallel ) works yet.
Enjoy Reading This Article?
Here are some more articles you might like to read next: