ICML 2024 Orals (top 10%): Summaries of interesting papers
ICML, one of the top machine learning conferences, is happening this week. I am excited about a few tutorials, orals & posters, and most importantly workshops towards the end of the week. I found the following oral papers interesting and wrote friendly summaries.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (From OpenAI)
Currently, LLMs are aligned with Reinforcement Learning with Human Feedback (RLHF), i.e. we label whether a response followed the human intent in the query or whether a response is safe. But when LLMs become superhuman we cannot label the responses of superhuman LLMs to make them better or aligned to human values. For instance, if an LLM generated a code repository of 1 million lines, we cannot easily label the code that it is safe or that it followed the user intent. In this scenario, humans, weak agents, have to somehow align strong agents, superhuman LLMs.
This paper from OpenAI presents empirical investigations in answering this question. Analogical to using human labels to align/train superhuman LLMs, the authors set up their experiments to train a strong model, GPT-4, trained with labels of GPT-2. If GPT-4 learns from the GPT-2 labels, they call it a weak-to-strong generalization/performance.
The paper starts by defining three concepts. 1, Weak performance: GPT-2 trained on ground truth labels for a task 2, Strong ceiling performance: GPT-4 trained on ground truth labels 3, Weak-to-strong performance: GPT-4 trained with labels from GPT-2 (First, GPT-2 is trained on a task and it labels data points from the held-out dataset. This labeled dataset is used to train GPT-4.)
Now, the below passage from the paper defines the performance gap recovered (PGR), the extent strong student (GPT-4) recovered the strong ceiling performance with the labels from a weak supervisor (GPT-2). Hig PGR means that the weak supervisor can train strong students, or humans could align superhuman LLMs in the future.
Fortunately, weak-to-strong generalization is possible as shown across diverse tasks. 80% in NLP tasks and over 60 % in reward modeling tasks as seen in the below figure. Chess Puzzles saw the least weak-to-strong generalization with below 40 percent.
The paper also proposed approaches to improve weak-to-strong generalization:
Auxiliary loss based on strong student’s confidence: While training with labels from a weak supervisor (GPT-2), sometimes a strong student (GPT-4) might overfit to the mistakes of a weak supervisor. The paper proposed to add a loss term based on the confidence of the strong student helping it to stick to its own prediction when in conflict with the weak supervisor.
Bootstrapping: Instead of a weak model (GPT-2) supervising the strong model (GPT-4) directly, this approach utilizes intermediate models in between the weak and strong. For instance, if the weak model is 1B and the strong model is 100B, a sequence of models that are 2B, 4B, 8B, 16B, 32B, and 64B would be used. The bootstrap process starts by training the 2B model with weak labels, the 4B model with 2B model labels, the 8B model with 4B model, and so on. The strong model would be trained with labels of the model before it in the above sequence.
Unsupervised generative fine-tuning: Sometimes just training with the next token prediction on the training data without labels would help in learning the right representation or eliciting the required behavior for that task. This approach includes unsupervised fine-tuning with all data points without labels before fine-tuning.
The paper showed that the above approach improved weak-to-strong generalization compared to standard fine-tuning with weak labels. (Notice the difference between dotted lines and solid lines)
Interpreting and Improving Large Language Models in Arithmetic Calculation
Although Large Language Models (LLMs) show impressive performance in solving math word problems and perform descent arithmetic calculations, we don’t yet understand how they do mathematical reasoning. This paper shines a light on the mechanisms that underlie mathematical reasoning.
For templates of arithmetic calculations shown in the above figure, the authors tested LLaMA2–7B and 13B models where input, for instance, would be ‘3 + 5 =’ and the output would be ‘8’. The first step in their investigations is to understand which attention heads are affecting the prediction of token 8, the result.
Through a technique called path patching, they observed that only <5% of attention heads are important in performing arithmetic calculation. You can see in the figure below that there are very few heads or MLPs with darker colors.
They also observed that removing identified “arithmetic” heads affected the performance catastrophically compared to removing the same amount of random attention heads.
Interestingly, they observed specific attention heads that attend to operators (+,-,..) and operands (numbers). Figure 5 below shows attention weights being high at positions with operators and operands respectively.
The authors also investigated where the same “arithmetic” heads play a crucial role in mathematical reasoning in other datasets. Figure 4 below shows that after removing (knockout) the “arithmetic” heads the LLM produces wrong answers for data points that were predicted correctly before.
Having identified the components that perform arithmetic reasoning, they proposed to fine-tune only the “arithmetic” heads (10% parameters) and improve the performance on mathematical reasoning tasks like GSM8k and SVAMP. Dubbed as Precise SFT in the below figure, it shows improvements compared to full SFT with 3x training speed. It is interesting to note that precise finetuning on mathematical tasks didn’t affect performance on generic tasks as much as full fine-tuning.
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
Motivated by a disproportionate increase in adjectives such as “intricate” & “meticulous” in peer reviews, the authors devised a statistical parameter inference approach to estimate the ratio of AI-generated text at prominent deep learning conferences and Nature portfolio journals. As suspected, they estimated that 10.6% of ICLR 2024 review sentences and 16.9% of EMNLP are completely AI-generated.
The paper started by explaining the extreme difficulty of identifying LLM-generated content instance by instance. For example, techniques like zero-shot LLM detection, fine-tuned models for classifying LLM-generated content, or LLM watermarking classify each document as LLM-generated or not. However, these techniques are shown to be similar to random predictors or such mechanisms reduce the coherence of LLM generation.
Alternatively, the authors proposed to estimate the ratio of AI-generated sentences/documents in a corpus rather than classifying them at the instance level. They assumed that a corpus would be generated from the mixture distribution of P (human-generated) and Q (LLM-generated) as in the below figure, with alpha as the ratio of LLM-generated content. The log-likelihood of any corpus could be formulated with an alpha parameter like below.
The authors estimated the true P and Q distribution with training corpora with a known ratio of human vs AI-generated content (alpha). Knowing the P and Q distribution, we could then estimate the alpha for any corpus by maximum likelihood estimation (MLE). (In simple terms, perturb the alpha many times to get the right alpha that fits the given corpus).
The entire process is depicted in the following figure: 1, Create human and LLM-generated corpus 2, Temporally split training corpus and validation corpus 3, Estimate the true P and Q distributions with training corpus 4, Validate the estimated distributions on validation corpus where alpha is known 5, Finally, estimate the alpha on the corpus of interest, here ICLR 2024 or NeurIPS 2023 and so on. This gives us the ratio of LLM-generated sentences/documents.
The paper also has interesting correlations between the ratio of LLM-generated content and different aspects of peer review. The estimated fraction of LLM-generated text is higher in reviews that don’t have scholarly citations (reference effect) and from reviewers who are less likely to respond to author rebuttals (lower reply rate effect). The LLM-generated text also tends to be homogeneous, which reduces the value-add of multiple reviews per paper (homogenization effect).
The paper also has a nice Box with all the main findings.