System for machine learning papers

For each paper I will give a brief summary and my thoughts on it. I will link each paper to other related papers if possible. Besides, I might add code and related resources if I have time.

For most papers I will first read titles and abstract to decide whether I should read the rest part of the paper. This helps me to quickly filter out papers that are interesting to me.

So for majority of papers I will make the summary short. For others that are interesting to me, I will write a longer summary.

Characterization of Large Language Model Development in the Datacenter

https://www.usenix.org/system/files/nsdi24-hu.pdf

Summary: This papers studies llm traning job workload in datacenter. It releases job traces in datacenter for training llm. It mentions that gpu take 65% of power usage.

I didn’t find any other interesting contribution from this paper other than the job traces. So I won’t spend more time reading this paper.

Does this paper mention llm serving ?

Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

Preemptive scheduling and checkpointing? https://www.usenix.org/conference/nsdi24/presentation/duan




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learning-based memory allocation for C++ server workloads summary
  • my question:
  • Binary search algorithm variant
  • Docker Rocksdb build
  • Difference between Dockerfile and Docker Compose