System for machine learning papers
For each paper I will give a brief summary and my thoughts on it. I will link each paper to other related papers if possible. Besides, I might add code and related resources if I have time.
For most papers I will first read titles and abstract to decide whether I should read the rest part of the paper. This helps me to quickly filter out papers that are interesting to me.
So for majority of papers I will make the summary short. For others that are interesting to me, I will write a longer summary.
Characterization of Large Language Model Development in the Datacenter
https://www.usenix.org/system/files/nsdi24-hu.pdf
Summary: This papers studies llm traning job workload in datacenter. It releases job traces in datacenter for training llm. It mentions that gpu take 65% of power usage.
I didn’t find any other interesting contribution from this paper other than the job traces. So I won’t spend more time reading this paper.
Does this paper mention llm serving ?
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Preemptive scheduling and checkpointing? https://www.usenix.org/conference/nsdi24/presentation/duan
Enjoy Reading This Article?
Here are some more articles you might like to read next: