llm.c

roadmap

  • Running llm.c
  • Running llm.c with cuda
  • Multiple gpus
  • Inference with fp16
  • Inference with vllm
  • try other inference acceleartion tech
  • Submit pr to check existence of openmpi by specifying openmpi path. ( can use conda as example)
  • Submit pr to check curl result when proxy server returns 503 error.

Running llm.c

https://github.com/karpathy/llm.c

GPU

Had issue running gpu There is only cuda 11.2 on my machine but torch 2.1.0 is installed which requires cuda 12.0

Solution: Manually specify torch==1.3.1

Get error

yhrun -n 4 -p gpu_v100 python train_gpt2.py
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
yhrun: error: gpu55: tasks 0-3: Exited with exit code 1

The issue is that nullcontext is introduced in python >=3.7 So I need to upgrade python version

Still can not solve problem above because I can’t not import new module to existing module list.

Currently Loaded Modulefiles:
 1) proxy/1.0   2) CUDA/10.0   3) cudnn/7.6.4-CUDA10.0   4) PyTorch/1.2.0-CUDA10.0-py3.6

 $ yhrun -n 4 -p gpu_v100 python train_gpt2.py
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
Traceback (most recent call last):
  File "train_gpt2.py", line 24, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'

My friend told me that I can just use conda to create new namespace and then I can ssh to the compute node and activate the conda environment. And then I can run training process.

This means that compute node shares the same file system with login node. But the operating system is different. Because each node has its own hostname.

Learn new thing every day.

Here’s all available nodes I have.

Karpathy has updated gpt2 parameter download script so now I can download parameter via shell script

Issue: Can not connect to huggingface todownload pretrained model via proxy

(llmc) [nsccgz_qylin_1@ln102%tianhe2-K llm.c]$ curl -v https://huggingface.co
* About to connect() to proxy 10.20.18.21 port 3128 (#0)
*   Trying 10.20.18.21...
* Connected to 10.20.18.21 (10.20.18.21) port 3128 (#0)
* Establish HTTP proxy tunnel to huggingface.co:443
> CONNECT huggingface.co:443 HTTP/1.1
> Host: huggingface.co:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Proxy-Agent: gost/2.11.1
< Content-Length: 0
<
* Received HTTP code 503 from proxy after CONNECT
* Connection #0 to host 10.20.18.21 left intact
curl: (56) Received HTTP code 503 from proxy after CONNECT

Solution: I decide to download on my local laptop and then upload these model parameter files to gpu nodes.

chmod u+x ./dev/download_starter_pack.sh
./dev/download_starter_pack.sh
make train_gpt2fp32cu
./train_gpt2fp32cu

cuda env:

Currently Loaded Modulefiles:
 1) proxy/1.0   2) python/3.6.7_anaconda3   3) CUDA/11.2   4) gmp/4.2.4   5) mpfr/2.4.2   6) mpc/0.8.1   7) gcc/9.2.0

Output :

step   61/74: train loss 3.213066 (312.014672 ms, 13127 tok/s)
step   62/74: train loss 3.450736 (314.262273 ms, 13033 tok/s)
step   63/74: train loss 3.370245 (315.130342 ms, 12997 tok/s)
step   64/74: train loss 3.407992 (316.778140 ms, 12930 tok/s)
step   65/74: train loss 3.580323 (315.324538 ms, 12989 tok/s)
step   66/74: train loss 3.029552 (317.274858 ms, 12909 tok/s)
step   67/74: train loss 3.296448 (317.588671 ms, 12897 tok/s)
step   68/74: train loss 3.675703 (314.929981 ms, 13006 tok/s)
step   69/74: train loss 3.297087 (313.282229 ms, 13074 tok/s)
step   70/74: train loss 3.646337 (315.271277 ms, 12991 tok/s)
step   71/74: train loss 3.566427 (316.123225 ms, 12956 tok/s)
step   72/74: train loss 3.732521 (315.446478 ms, 12984 tok/s)
step   73/74: train loss 3.825229 (318.325142 ms, 12867 tok/s)
step   74/74: train loss 3.380326 (318.066751 ms, 12877 tok/s)
val loss 3.491223
generating:
---
BUCKINGHAM:
But of my penitent ambition
Rome Slicom against Reimy, justice about him!
In case the witness should speak with joy:
Shall now that by these dwelling House,
Suspicions are declaim'd of the Albanian king.
Go
---
total average iteration time: 312.354733 ms

Multiple GPUs

Run with MPI. Don’t know mpi works internally but I will just start using it to train model.

I will learn the internals later.

Now I just login to gpu node and run the following command

make train_gpt2cu
mpirun -np <number of GPUs> ./train_gpt2cu

Issue: failed to compile with openmpi I used hpc cluster which has openmpi library installed in directory that is different from standard directory.

Here’s Makefile in llm.c

ifeq ($(NO_MULTI_GPU), 1)
  $(info  Multi-GPU (OpenMPI + NCCL) is manually disabled)
else
  ifneq ($(OS), Windows_NT)
    # Detect if running on macOS or Linux
    ifeq ($(SHELL_UNAME), Darwin)
      $(info  Multi-GPU on CUDA on Darwin is not supported, skipping OpenMPI + NCCL support)
    else ifeq ($(shell [ -d /usr/lib/x86_64-linux-gnu/openmpi/lib/ ] && [ -d /usr/lib/x86_64-linux-gnu/openmpi/include/ ] && echo "exists"), exists)
      $(info  OpenMPI found, OK to train with multiple GPUs)
      NVCC_INCLUDES += -I/usr/lib/x86_64-linux-gnu/openmpi/include
      NVCC_LDFLAGS += -L/usr/lib/x86_64-linux-gnu/openmpi/lib/
      NVCC_LDLIBS += -lmpi -lnccl
      NVCC_FLAGS += -DMULTI_GPU
    else
      $(info  OpenMPI is not found, disabling multi-GPU support)
      $(info ---> On Linux you can try install OpenMPI with `sudo apt install openmpi-bin openmpi-doc libopenmpi-dev`)
    endif
  endif
endif

It checks existence of openmpi library in /usr/lib/x86_64-linux-gnu/openmpi/lib/ Openmpi library is at ` ~/local/lib/` in my hpc cluster. Should I raise a pr?

Issue: Get compilation error when linking nccl

/GPUFS/app_GPU/compiler/CUDA/11.2.0/bin/nvcc -O3 -t=0 --use_fast_math -std=c++17 -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -L~/local/lib/  -I~/local/include/   -lmpi -lnccl -o train_gpt2cu
llmc/zero.cuh(28): error: identifier "ncclBfloat16" is undefined

llmc/zero.cuh(209): error: identifier "ncclAvg" is undefined

llmc/zero.cuh(219): error: identifier "ncclAvg" is undefined

3 errors detected in the compilation of "train_gpt2.cu".
make: *** [train_gpt2cu] Error 255

I have load nccl module but I still get this error and I don’t know how to fix it. Try to compile train_gpt2fp32cu

module load  CUDA/11.2
module load  gcc/9.2.0
#module load  openmpi/1.10.2-pgi-17.1
module load openmpi/3.1.4-icc-18.0.1
module load  nccl/2.9.9-1-cuda-11.0
module list
which nvcc
pushd llm.c
#make train_gpt2cu
make train_gpt2fp32cu
mpirun -np 2 ./train_gpt2fp32cu
popd

Get out of memory error when running

num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 8                                                  |
+-----------------------+----------------------------------------------------+
allocated 474 MiB for model parameters
| train_num_batches     | 74                                                 |
| val_num_batches       | 8                                                  |
+-----------------------+----------------------------------------------------+
allocated 474 MiB for model parameters
allocated 5706 MiB for activations
allocated 5706 MiB for activations
val loss 4.513921
val loss 4.513921
allocated 474 MiB for parameter gradients
allocated 252 MiB for activation gradients
allocated 474 MiB for AdamW optimizer state m
allocated 474 MiB for AdamW optimizer state v
allocated 474 MiB for parameter gradients
allocated 252 MiB for activation gradients
[CUDA ERROR] at file train_gpt2_fp32.cu:1443:
out of memory

I am not famaliar with how cuda can work with multiple gpus when doing training.

Should I learn a little bit more about how can I use multiple gpus to do computation when working with cuda?

Why do we have to use mpi to run with multiple gpus?

Let’s check whether single gpu code actually uses single gpu or not.

There’s only one gpu running when training with single gpu. And it works pretty well. So I want to know how to use multiple gpus to train model.

(base) [nsccgz_qylin_1@gpu29%tianhe2-K zt]$ nvidia-smi
Fri Jun 21 12:17:35 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   62C    P0   276W / 300W |   8354MiB / 16160MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   32C    P0    38W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   31C    P0    37W / 300W |      2MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   31C    P0    37W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    153170      C   ./train_gpt2fp32cu               8351MiB |
+-----------------------------------------------------------------------------+

Here’s the code that initializes multi gpu training config in train_gpt2.cu

MultiGpuConfig multi_gpu_config_init(int *argc, char ***argv) {
#ifdef MULTI_GPU
    // Initialize MPI.
    MultiGpuConfig result;
    mpiCheck(MPI_Init(argc, argv));
    mpiCheck(MPI_Comm_rank(MPI_COMM_WORLD, &result.process_rank));
    mpiCheck(MPI_Comm_size(MPI_COMM_WORLD, &result.num_processes));
    result.local_device_idx = multi_gpu_get_local_device_idx(result.process_rank, result.num_processes);
    cudaCheck(cudaSetDevice(result.local_device_idx));
    ncclUniqueId nccl_id;
    if (result.process_rank == 0) {
        ncclCheck(ncclGetUniqueId(&nccl_id));
    }
    mpiCheck(MPI_Bcast((void *)&nccl_id, sizeof(nccl_id), MPI_BYTE, 0, MPI_COMM_WORLD));
    ncclCheck(ncclCommInitRank(&result.nccl_comm, result.num_processes, nccl_id, result.process_rank));
    cudaCheck(cudaStreamCreate(&result.nccl_stream));
    // event without timing for maximum performance
    cudaCheck(cudaEventCreate(&result.compute_nccl_sync, cudaEventDisableTiming));
    nvtxNameCudaStreamA(result.nccl_stream, "nccl stream");
    nvtxNameCudaEventA(result.compute_nccl_sync, "nccl compute sync");

Ask bingchat to explain this code This code is for initializing a multi-GPU configuration using MPI (Message Passing Interface), CUDA, and NCCL (NVIDIA Collective Communications Library). Here’s a breakdown of what the code does:

  1. MultiGpuConfig result;: This line initializes a MultiGpuConfig object named result.

  2. mpiCheck(MPI_Init(argc, argv));: This line initializes the MPI environment.

  3. mpiCheck(MPI_Comm_rank(MPI_COMM_WORLD, &result.process_rank));: This line determines the rank of the calling process in the communicator.

  4. mpiCheck(MPI_Comm_size(MPI_COMM_WORLD, &result.num_processes));: This line determines the number of processes in the communicator.

  5. result.local_device_idx = multi_gpu_get_local_device_idx(result.process_rank, result.num_processes);: This line gets the local device index based on the process rank and the number of processes.

  6. cudaCheck(cudaSetDevice(result.local_device_idx));: This line sets the device to be used for GPU executions.

  7. ncclUniqueId nccl_id;: This line declares a unique NCCL ID.

  8. if (result.process_rank == 0) { ncclCheck(ncclGetUniqueId(&nccl_id)); }: This line generates a new NCCL unique ID and assigns it to nccl_id if the process rank is 0.

  9. mpiCheck(MPI_Bcast((void *)&nccl_id, sizeof(nccl_id), MPI_BYTE, 0, MPI_COMM_WORLD));: This line broadcasts the NCCL unique ID from the root process (process with rank 0) to all other processes in the communicator.

  10. ncclCheck(ncclCommInitRank(&result.nccl_comm, result.num_processes, nccl_id, result.process_rank));: This line initializes the NCCL communicator according to the rank, the number of processes, and the unique NCCL ID.

  11. cudaCheck(cudaStreamCreate(&result.nccl_stream));: This line creates a new asynchronous stream.

  12. cudaCheck(cudaEventCreate(&result.compute_nccl_sync, cudaEventDisableTiming));: This line creates an event with no timing data. This is used for synchronization purposes.

  13. nvtxNameCudaStreamA(result.nccl_stream, "nccl stream");: This line names the CUDA stream “nccl stream” for profiling and debugging purposes.

  14. nvtxNameCudaEventA(result.compute_nccl_sync, "nccl compute sync");: This line names the CUDA event “nccl compute sync” for profiling and debugging purposes.

This function is designed to be used in a multi-GPU environment where multiple processes are running in parallel, each possibly on a different GPU.

The use of MPI allows these processes to communicate with each other, and the use of NCCL allows them to collectively perform operations such as reductions and broadcasts.

The CUDA streams and events are used for managing and synchronizing the GPU computations.

The NVTX calls are used for profiling and debugging the GPU code.

Note that this code assumes that the MULTI_GPU macro is defined; if it is not, this function will not do anything.

Is there any way to run with multi-gpus without callling like this ?

make train_gpt2cu
mpirun -np <number of GPUs> ./train_gpt2cu

So now I will try to run very simple mpi and cuda program to test cuda and mpi env on my hpc cluster.

#include <stdio.h>
#include <cuda_runtime.h>
#include <mpi.h>

__global__ void helloFromGPU(void) {
    printf("Hello World from GPU %d!\n", blockIdx.x);
}

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(&argc, &argv);

    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Set the device to the rank of the current MPI process
    cudaSetDevice(world_rank);

    printf("Hello World from CPU of MPI process %d!\n", world_rank);

    // Launch the kernel on the GPU
    helloFromGPU<<<world_size, 1>>>();

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Finalize the MPI environment.
    MPI_Finalize();
}

~                                                                                                                                                                                   ~                                                                                                                                                                                   ~
(base) $ nvcc -I/GPUFS/nsccgz_qylin_1/local/include -L/GPUFS/nsccgz_qylin_1/local/lib  -lmpi hello_mpi_cuda.cu -o hello_mpi_cuda
(base) $ ls
hello.cu  hello_cuda  hello_mpi  hello_mpi.c  hello_mpi_cuda  hello_mpi_cuda.cu
(base) $ mpirun -np 2 ./hello_mpi_cuda
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            gpu29
  Device name:           mlx5_9
  Device vendor ID:      0x02c9
  Device vendor part ID: 4116

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu29
  Local device: mlx5_9
--------------------------------------------------------------------------
Hello World from CPU of MPI process 0!
Hello World from CPU of MPI process 1!
Hello World from GPU 0!
Hello World from GPU 1!
Hello World from GPU 0!
Hello World from GPU 1!
[gpu29:229661] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[gpu29:229661] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpu29:229661] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init

nvcc is just like gcc which is used to compile cuda code.

What is nccl? Basically nccl is just like mpi. It’s used to do collective communication between gpus.

This struct is used as config to maintain as information about each gpu.

// Parameters specific to training on multiple GPUs.
typedef struct {
  int process_rank;      // Rank of this process among all MPI processes on all hosts. 0 if no multi-GPU.
  int num_processes;     // Total number of processes on all hosts. 1 if no multi-GPU.
  int local_device_idx;  // This process GPU index on current machine. 0 if no multi-GPU.
  ncclComm_t nccl_comm;  // NCCL communication primitive, used for collective mutli-GPU work.
} MultiGpuConfig;

// Determine which GPU this process should use.
// Processes on the same machines use different GPU indicies. Processes on other machines don't.
// Copied from NCCL examples: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-2-one-device-per-process-or-thread
int multi_gpu_get_local_device_idx(int process_rank, int num_processes) {
  char hostname[1024];
  hostname[1023] = '\0';
  // All processes on the same machine will share the same hostname.
  gethostname(hostname, 1023);
  for (int i=0; i < 1024; i++) {
    if (hostname[i] == '.') {
        hostname[i] = '\0';
        break;
    }
  }
  uint64_t hostname_hash = 5381;
  for (int c = 0; hostname[c] != '\0'; c++){ hostname_hash = ((hostname_hash << 5) + hostname_hash) ^ hostname[c]; }

  // Distribute all hostname hashes to all processes.
  uint64_t* all_hostsname_hashes = (uint64_t*)malloc(num_processes * sizeof(uint64_t));
  all_hostsname_hashes[process_rank] = hostname_hash;
  mpiCheck(MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, all_hostsname_hashes, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD));

  // Identify which GPU we need to use.
  int local_device_idx = 0;
  for (int current_process = 0; current_process < num_processes; ++current_process) {
     if (current_process == process_rank) {
      // Found my gpu, local_device_idx now has my target GPU index.
      break;
     }
     if (all_hostsname_hashes[current_process] == all_hostsname_hashes[process_rank]) {
      // This process ID runs on the same machine, but it's not me, skip this GPU
      local_device_idx++;
     }
  }

  free(all_hostsname_hashes);
  return local_device_idx;
}

Get this error.

error: identifier "ncclAvg" is undefined

Check nccl commit message and found that ncclAvg is added in version 2.10.3 I have only nccl/2.9.9-1-cuda-11.0.

What should I do?

Just learned that I can use conda to install cuda and nccl. Let’s try it.

So what is conda and how does it work? I think conda is just a package manager like apt in ubuntu. conda helps with environment management and package installation.

apt helps with package installation and update.

spack is another package manager that is used in hpc cluster.

https://stackoverflow.com/questions/77873047/what-are-the-key-differences-between-spack-and-conda-package-managers

 conda install nvidia::cuda-toolkit

Get another compilation error

/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/nvcc -O3 -t=0 --use_fast_math -std=c++17 --generate-code arch=compute_70,code=[compute_70,sm_70] -DMULTI_GPU -DENABLE_FP32 train_gpt2.cu -lcublas -lcublasLt -L/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/  -I/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/include/  -lmpi -lnccl -o train_gpt2cu
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: /GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib//libcublas.so: undefined reference to `memcpy@GLIBC_2.14'
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: /GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/../lib/libpmix.so.2: undefined reference to `clock_gettime@GLIBC_2.17'
collect2: error: ld returned 1 exit status
make: *** [train_gpt2cu] Error 255
(llmc) [nsccgz_qylin_1@gpu29%tianhe2-K llm.c]$ gcc --version
gcc (conda-forge gcc 12.3.0-11) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

check glibc version.

(llmc) [nsccgz_qylin_1@gpu29%tianhe2-K llm.c]$ ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
(llmc) [nsccgz_qylin_1@gpu29%tianhe2-K llm.c]$ which ldd
/usr/bin/ldd

I think I need to use ldd in conda env to check glibc version. How can I do this? I can not install specified glibc version with conda.

Remove conda env and create a new one. Install nccl first

    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            2_gnu          23 KB  conda-forge
    cuda-version-12.5          |       hd4f0392_3          21 KB  conda-forge
    libgcc-ng-13.2.0           |      h77fa898_11         777 KB  conda-forge
    libgomp-13.2.0             |      h77fa898_11         434 KB  conda-forge
    libstdcxx-ng-13.2.0        |      hc0a3c3a_11         3.7 MB  conda-forge
    nccl-2.22.3.1              |       hbc370b7_0       107.3 MB  conda-forge
    ------------------------------------------------------------
                                           Total:       112.2 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  cuda-version       conda-forge/noarch::cuda-version-12.5-hd4f0392_3
  libgcc-ng          conda-forge/linux-64::libgcc-ng-13.2.0-h77fa898_11
  libgomp            conda-forge/linux-64::libgomp-13.2.0-h77fa898_11
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-13.2.0-hc0a3c3a_11
  nccl               conda-forge/linux-64::nccl-2.22.3.1-hbc370b7_0
conda install nvidia/label/cuda-12.5.0::cuda-toolkit

conda adjust channels priority

 conda config --describe channel_priority
conda config --set channel_priority flexible
conda config --prepend channels conda-forge
conda config --prepend channels nvidia
conda config --show channels

Error no nvtx3 and no openmpi

→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✗ OpenMPI is not found, disabling multi-GPU support
---> On Linux you can try install OpenMPI with `sudo apt install openmpi-bin openmpi-doc libopenmpi-dev`
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmcc/bin/nvcc -O3 -t=0 --use_fast_math -std=c++17 --generate-code arch=compute_70,code=[compute_70,sm_70] -DENABLE_FP32 train_gpt2.cu -lcublas -lcublasLt   -o train_gpt2cu
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [train_gpt2cu] Error 255

I remove channel conda-forge and try to install cuda from main anaconda channel.

channel config is stored in ~/.condarc

conda info
conda remove --name myenv --all

Fix openmpi issue after installing cuda from main channel

(llmc) [nsccgz_qylin_1@ln101 llm.c]$ PRECISION=FP32 make train_gpt2cu
__nvcc_device_query failed to call cudaLoader::cuInit(0) with error 0x64 (CUDA_ERROR_NO_DEVICE)
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/nvcc -O3 -t=0 --use_fast_math -std=c++17 -DMULTI_GPU -DENABLE_FP32 train_gpt2.cu -lcublas -lcublasLt -L/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/  -I/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/include/  -lmpi -lnccl -o train_gpt2cu
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [train_gpt2cu] Error 255

Forget that I should ssh to compute node. Let’s try again

Still get no nvtx3 error

(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ PRECISION=FP32 make train_gpt2cu
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/nvcc -O3 -t=0 --use_fast_math -std=c++17 --generate-code arch=compute_70,code=[compute_70,sm_70] -DMULTI_GPU -DENABLE_FP32 train_gpt2.cu -lcublas -lcublasLt -L/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/  -I/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/include/  -lmpi -lnccl -o train_gpt2cu
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from train_gpt2.cu:34:
llmc/cuda_common.h:12:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
   12 | #include <nvtx3/nvToolsExt.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [train_gpt2cu] Error 255

Get installment error again

(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ conda install cuda-nvtx -c nvidia
Channels:
 - nvidia
 - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
 - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

InvalidSpec: The package "nvidia/linux-64::cuda==12.5.0=0" is not available for the specified platform

(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ exit

https://github.com/NVIDIA/NVTX I will manually pull header files to node if I can’t now solve this issue this time by install 12.0.0 version of cuda

conda install nvidia/label/cuda-12.0.0::cuda-toolkit

Fix nvtx header file after copying to system path

In file included from train_gpt2.cu:65:
llmc/zero.cuh:16:10: fatal error: nccl.h: No such file or directory
   16 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
In file included from train_gpt2.cu:65:
llmc/zero.cuh:16:10: fatal error: nccl.h: No such file or directory
   16 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K nccl_2.22.3-1+cuda12.4_x86_64]$ cp -r include/* /GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/include/
(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K nccl_2.22.3-1+cuda12.4_x86_64]$ cp -r lib/* /GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib
nvcc -O3 -t=0 --use_fast_math -std=c++17 --generate-code arch=compute_70,code=[compute_70,sm_70] -DMULTI_GPU -DENABLE_FP32 train_gpt2.cu -lcublas -lcublasLt -L/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/  -I/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/include/  -lmpi -lnccl -o train_gpt2cu
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: /GPUFS/nsccgz_qylin_1/local/lib/libopen-pal.so.40: undefined reference to `pci_device_cfg_read'

/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/ is not included in LD_LIBRARY_PATH when doing ld command. So I need to include that in LD_LIBRARY_PATH before running ld command.

(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ echo $LD_LIBRARY_PATH
/GPUFS/nsccgz_qylin_1/local/lib:/GPUFS/nsccgz_qylin_1/local/lib:/GPUFS/nsccgz_qylin_1/t/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/libevent-2.1.12-oysfi7miuhw62ginwjrr2uy6yldr2oav/lib:/GPUFS/app_GPU/application/anaconda3/5.3.1/envs/python-3.6/lib:/GPUFS/nsccgz_qylin_1/ryz/MARL-test/util/lib:/GPUFS/nsccgz_qylin_1/ryz/icf_test/build_GPTL/gptl_gcc/lib::/GPUFS/nsccgz_qylin_1/software/spack/opt/spack/linux-centos7-skylake_avx512/

Should I switch to spack next time?

Finally I am able to compile this train_gpt2cu after several days of failing and trying. I put conda lib at the first of LD_LIBRARY_PATH and it works.

llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ export  LD_LIBRARY_PATH=/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/:$
LD_LIBRARY_PATH
(llmc) [nsccgz_qylin_1@gpu30%tianhe2-K llm.c]$ echo $LD_LIBRARY_PATH
/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/:/GPUFS/nsccgz_qylin_1/local/lib:/GPUFS/nsccgz_qylin_1/local/lib:/GPUFS/nsccgz_qylin_1/t/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/libevent-2.1.12-oysfi7miuhw62ginwjrr2uy6yldr2oav/lib:/GPUFS/app_GPU/application/anaconda3/5.3.1/envs/python-3.6/lib:/GPUFS/nsccgz_qylin_1/ryz/MARL-test/util/lib:/GPUFS/nsccgz_qylin_1/ryz/icf_test/build_GPTL/gptl_gcc/lib::/GPUFS/nsccgz_qylin_1/software/spack/opt/spack/linux-centos7-skylake_avx512/:/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/

Issue: cuda driver version is not compatible with cuda runtime version. What should I do?


gpu 30
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   39C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   33C    P0    37W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   36C    P0    51W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Guess I have to install cuda 11.2 to match the driver version. Have to download nccl with cuda 11.

However, I don’t find cuda:toolkit that is compatible with cuda 11.2 in conda.

I think one solution is to compile nccl manually on this compute node with old cuda driver.

But now I just switch use A800 which comes with cuda 12.0 driver. Let’s go. I have two A800 available so I can test cross node training and multi gpu card training. This is great.

Issue: Seems that program is loading a .bin file with _bf16

Precision is configured as FP32 but model at gpt2_124M_bf16.bin is not.
---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`
---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?
Precision is configured as FP32 but model at gpt2_124M_bf16.bin is not.
---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`
---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?
Precision is configured as FP32 but model at gpt2_124M_bf16.bin is not.
---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`
---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?
Precision is configured as FP32 but model at gpt2_124M_bf16.bin is not.
---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`
---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?

Finally I think I successfully run model training on two gpus on A800. Here’s compile and run script.

compile.sh

export LD_LIBRARY_PATH=/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/:$LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH
yhrun -n1 -p GPU_A800 make train_gpt2cu

run.sh

#!/bin/bash
conda activate llmc
export LD_LIBRARY_PATH=/GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/lib/:$LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH
yhrun -n2 -p GPU_A800 /GPUFS/nsccgz_qylin_1/miniconda3/envs/llmc/bin/mpirun -np 2 train_gpt2cu

output

| weight init method    | OpenAI's GPT-2 checkpoint                          |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| weight init method    | OpenAI's GPT-2 checkpoint                          |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 37                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 37                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 2                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=2 and total_batch_size=8192

step   15/37 | train loss 3.640566 | norm 2.2486 | lr 3.00e-04 | 268.99 ms | -100.0% bf16 MFU | 63701 tok/s
step   14/37 | train loss 3.381429 | norm 2.5808 | lr 3.00e-04 | 287.48 ms | -100.0% bf16 MFU | 57996 tok/s
step   16/37 | train loss 3.413064 | norm 2.0632 | lr 3.00e-04 | 174.35 ms | -100.0% bf16 MFU | 62144 tok/s
step   15/37 | train loss 3.640008 | norm 2.3061 | lr 3.00e-04 | 147.72 ms | -100.0% bf16 MFU | 57748 tok/s
step   17/37 | train loss 3.584605 | norm 2.0251 | lr 3.00e-04 | 158.54 ms | -100.0% bf16 MFU | 61209 tok/s
step   16/37 | train loss 3.412876 | norm 2.0334 | lr 3.00e-04 | 135.09 ms | -100.0% bf16 MFU | 58018 tok/s
step   18/37 | train loss 3.486408 | norm 1.6377 | lr 3.00e-04 | 115.45 ms | -100.0% bf16 MFU | 62047 tok/s
step   17/37 | train loss 3.584733 | norm 2.0324 | lr 3.00e-04 | 141.30 ms | -100.0% bf16 MFU | 58014 tok/s
step   19/37 | train loss 3.450470 | norm 2.1760 | lr 3.00e-04 | 120.89 ms | -100.0% bf16 MFU | 62521 tok/s
step   18/37 | train loss 3.487690 | norm 1.6469 | lr 3.00e-04 | 128.45 ms | -100.0% bf16 MFU | 58510 tok/s
step   20/37 | train loss 3.542054 | norm 2.2739 | lr 3.00e-04 | 66.91 ms | -100.0% bf16 MFU | 67331 tok/s
step   19/37 | train loss 3.451924 | norm 2.1979 | lr 3.00e-04 | 118.83 ms | -100.0% bf16 MFU | 59375 tok/s
step   20/37 | train loss 3.544027 | norm 2.2397 | lr 3.00e-04 | 108.95 ms | -100.0% bf16 MFU | 60644 tok/s

6x throughput compared to v100.

However, I am not sure if this trully accelerate training. How can I verify that multiple gpu training accelerate training process ?

CPU

pip install -r requirements.txt
python dev/data/tinyshakespeare.py
python train_gpt2.py
make train_gpt2
OMP_NUM_THREADS=8 ./train_gpt2

Output

step 20: train loss 4.527330 (took 2636.617334 ms)
step 21: train loss 4.065797 (took 2701.692621 ms)
step 22: train loss 3.965316 (took 2681.297241 ms)
step 23: train loss 3.449409 (took 2650.111416 ms)
step 24: train loss 4.490954 (took 2637.116332 ms)
step 25: train loss 4.035361 (took 2659.843151 ms)
step 26: train loss 3.445302 (took 2652.557792 ms)
step 27: train loss 3.993789 (took 2649.868369 ms)
step 28: train loss 4.199468 (took 2638.095098 ms)
step 29: train loss 4.538460 (took 2669.385015 ms)
val loss 4.350866
step 30: train loss 4.306292 (took 2658.306411 ms)
step 31: train loss 4.851407 (took 2634.616368 ms)
step 32: train loss 4.577479 (took 2670.470130 ms)
step 33: train loss 4.124943 (took 2660.545565 ms)
step 34: train loss 4.330319 (took 2669.532886 ms)
step 35: train loss 3.399416 (took 2639.378693 ms)
step 36: train loss 3.661207 (took 2632.377219 ms)
step 37: train loss 3.330453 (took 2637.114896 ms)
step 38: train loss 3.567853 (took 2645.744510 ms)
step 39: train loss 3.902004 (took 2635.939546 ms)
val loss 4.319361
generating:
---
EditBOOK IX:
Under the boasted sute of Georges:
So lordly is the prize had sin is high;
Hell is the way to God: frankish friends from blessed daughters
To Bermuda have heard the saying,
Then how to place the artscape.
Strong should a bellow
---
step 40: train loss 3.952987 (took 2665.948189 ms)

Some questions? How many low end gpus are there in the market? I am thinking about utilizing low end gpus to train model, large or small model.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learning-based memory allocation for C++ server workloads summary
  • my question:
  • Binary search algorithm variant
  • Docker Rocksdb build
  • Difference between Dockerfile and Docker Compose