Jackson Petty research@jacksonpetty.org
Department of Linguistics
New York UniversitySjoerd van Steenkiste svansteenkiste@google.com
Google ResearchTal Linzen linzen@google.com
Google ResearchWork done while a student researcher at Google.
Abstract
Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining corpora may improve performance on other, unrelated tasks, yet to date no work has been able to establish a causal connection by controlling between language and code data.Here we do just this. We pretrain language models on datasets which interleave natural language and code in two different settings: competitive, in which the total volume of data seen during pretraining is held constant; and additive, in which the volume of language data is held constant. We study how the pretraining mixture affects performance on (a) a diverse collection of tasks included in the BigBench benchmark, and (b) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformations. We find that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increase code mixture can harm performance on other tasks, including on tasks that requires sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.
1 Introduction
Large language models (LLMs) are increasingly used not only as natural-language assistants, but also for programming.LLMs which are trained on corpora containing code in various programming languages are used as programming assistants capable of generating code from natural-language descriptions (Chen etal., 2021), translating code between programming languages (Lachaux etal., 2020), decompilation of machine code into back into human-readable source code (Hosseini & Dolan-Gavitt, 2022), repairing vulnerabilities in existing code (Pearce etal., 2022), and even acting as programming agents when paired with tools (Yang etal., 2024a). These use cases have motivated adding code to pretraining corpora (see, inter alia, Gemini Team etal. 2024; OpenAI etal. 2024; Anthropic AI 2024; Groeneveld etal. 2024).
Concomitant to the inclusion of code in pretraining corpora, the performance of LLMs on many tasks has improved. Relevant for our purposes, many of the best-performing models include code in their pretraining corpus (see, inter alia, Fu & Khot 2022; Ye & Durrett 2022; Ye etal. 2023; Zhang etal. 2023; Zhou etal. 2023; Kim etal. 2024; Ma etal. 2024; Yang etal. 2024b; Razeghi etal. 2024; Coda-Forno etal. 2024; Longpre etal. 2024). That models trained in part on code perform well on several non-programming benchmarks raises intriguing questions: Does pretraining on code confer an advantage on non-programming tasks? If so, given a fixed compute budget, how much data should be allocated to code instead of natural-language data?
Establishing a causal relationship between code pretraining and downstream performance is difficult. Earlier studies have tackled these questions by comparing off-the-shelf code and no-code models (see, inter alia, Kim etal. 2024; Coda-Forno etal. 2024). Such observational studies are limited by the design choices of model creators and the availability of information about hyperparameters and training data. Many of the models typically surveyed are proprietary, and don’t disclose this information. While pairs of open-source models differing only in their pretraining corpora do exist, such as Llama 2 & Code Llama (Touvron etal., 2023; Roziere etal., 2023) or Gemma & CodeGemma (Gemma Team etal., 2024; Google, 2024), they often come with two important caveats: first, the code-variants of the models are derived by taking the non-code variants and conducting additional pretraining on code data, meaning the comparisons cannot control for total data volume; second, each pair treats the inclusion of code data as a binary variable, either present or absent, frustrating attempts to explore how changes in the amount of code influence downstream behavior.
We address these issues directly. We construct datasets that mix natural-language and source-code data at varying ratios, treating code inclusion as a continuous variable. We then pretrain language models of equal size on these parameterized datasets in two different experimental setups: a competitive setting where we keep the total volume of training data constant and vary the percentage allocated between code and natural language; and an additive setting where we keep the volume of language data constant and add additional amounts of code on top.
Previous work has found that augmenting training data with synthetic formal languages instantiating compositional patterns can improve compositional generalization (Papadimitriou & Jurafsky, 2023; Yao & Koller, 2024; Lindemann etal., 2024). Like formal languages, source code has a number of qualities which may aid models on seemingly-unrelated tasks: it is highly structured, by virtue of its conformance to the syntax of the programming language its written in; it is generally high-quality, owing to the use of linting and bug-checking tools and programming methodologies employed by its authors; it has interpretable semantics which is grounded by the functionality it describes; and, notably for compositionality, it contains instances of identical arguments and functions (e.g., variable names and method signatures). Informed by these observations, we evaluate our trained models for compositional generalization by finetuning them on three compositional generalization benchmarks (COGS, COGS-vf, and English Passivization). We also measure their performance on a broad array of tasks from BigBench to see how well code helps or hurts performance on unrelated domains.
We find that including code in a model’s pretraining corpus has noticeable impacts on its performance on downstream tasks, in varying directions. Higher code mixtures improve performance in arithmetic and compositionality in domains whose output has formal structure (like semantic parsing). Conversely, increased exposure to code can harm language model performance on purely-linguistic tasks and tasks involving factual knowledge. We conduct permutation tests to study the impact of pretraining on downstream tasks and show that code pretraining increases the variance on task performance while raising the performance on the upper-quartile of tasks.
2 Related Work
Earlier work has studied whether pretraining on code is beneficial for non-programming tasks. Observational studies have looked at the impact of code on downstream performance post-hoc.Fu & Khot (2022) speculated that code pretraining is at least partially responsible for the improvement in capabilities between the -001 and -002 series of GPT-3(.5) models, specifically highlighting chain-of-thought reasoning, long-term dependency sensitivity, and “complex reasoning” as likely resulting from code pretraining.Yang etal. (2024b) provides a broad study of how code impacts language model capabilities, arguing that code improves complex reasoning and structured data understanding.Mueller etal. (2024) shows that code pretraining improves generalization on syntax-sensitive in-context learning tasks.By contrast, Coda-Forno etal. (2024), in an observational study, conclude that code pretraining does not improve model performance on a benchmark of behavioral tasks motivated by cognitive psychology.Kim etal. (2024) show that code pretraining improves models’ entity-tracking capabilities.
Several experimental studies on the impact of code pretraining have also been conducted. Ma etal. (2024) attempt to verify the impact of code experimentally, comparing the parameter CodePanGu2.6 model trained on a mixture of natural-language and code data to Zeng etal. (2021)’s and parameter PanGu models of the same architecture trained only on natural language data. They conclude that code exposure, both during pretraining and instruction finetuning, is beneficial for performance on logical, legal, analogical, and scientific reasoning, and for chain-of-thought capabilities, though their experimental design does not control for data volume ( tokens for PanGu2.6/13 versus111There is some ambiguity in the way Ma etal. (2024) describe their dataset: first, they cite that PanGu13 is trained on 1TB of data, but Zeng etal. (2021) report that it is trained on 100GB of data while their far larger parameter model is the one trained on 1TB of data; second, Ma etal. (2024) detail the individual data sources in GB but report the total dataset size in terms of tokens. It is unclear from phrasing whether their sampling strategy yields a dataset of in total, or contains of text data in addition to of code data, but in either case the Table4 in Zeng etal. (2021) shows that the natural-language dataset used for the PanGu comparison models contains only tokens compared to CodePanGu’s tokens. tokens for CodePanGu2.6) and does not quite control for model and training hyperparameters (models differ in the number of attention heads and use slightly different optimizer settings, which are magnified by the large difference in the number of training steps due to the difference in dataset size). Ma etal. (2024) also show exposing code to models early on during training can be helpful for some tasks.Longpre etal. (2024) show experimentally that removing code from a model’s pretraining corpus harms performance on question answering in a number of different domains, though their experimental setup does not control for data volume and, consequently, other training hyperparameters sensitive to this.
3 Dataset Construction
To study how the amount of code in a language model’s pretraining corpus impacts downstream performance, we construct datasets which interleave natural language and code sequences. The ingredients for our datasets are the English portion of the Colossal Cleaned Common Crawl (C4; Raffel etal. 2023) and cleaned code from GitHub.
Each dataset, which we refer to as a ‘code mixture,’ is parameterized by a single value representing the percentage of code in the training data, under the assumption that the C4 dataset has been fully cleaned of any code data. The mixture relates the number of total tokens in the dataset to the number of code and language tokens via
We construct families of training datasets in two different settings: competitive, in which the total amount of data is held constant while varies, reducing the number of language tokens as the number of code tokens increases; and additive, in which the number of language tokens is held constant while the number of code tokens increases proportional to (see fig.1).
Competitive:
Here, is held constant while varies between and . This means that models trained on the code mixture see language tokens and code tokens, while those trained on the mixture see tokens of language data and tokens of code data.
This setting provides the clearest way to quantify the marginal utility of training on code instead of language, since we control for the total volume of data seen and consequently the total compute cost. However, the interpretability of results on mixtures with high values of may be diminished since removing nearly all natural-language training data from a model’s training corpus will lessen its ability to interpret and generate language; this in turn may greatly reduce its utility, even on code-related tasks, since the model will have far less ability to understand prompts or follow instructions.Additionally, the applicability of any results here to established pretraining setups may be limited by the fact that it will always be better in an absolute sense (and may be in a compute-optimal sense) to train a model on more data rather than less data (see, for instance, the conclusions of Hoffmann etal. 2022). Given this incentive, artificially limiting the amount of either code or language data provided to a model may not accurately reflect the considerations of model developers who, if they want to improve the code performance of a model, will simply add additional code data to the training corpus. To mitigate these issues, we also consider a second setting:
Additive:
Here, is held constant while varies between and . In order to keep fixed while varies, we increase the number of total tokens proportionally:
Since increases unboundedly in , we limit our study to consider additive mixtures of at most code, which have twice as many tokens as the mixture, which is identical to the competitive mixture. This setting guarantees that all models have seen the same amount of natural language data, ameliorating the concern that any degradation in performance may result from insufficient exposure to natural language, but at the cost of failing to control for total data volume or compute. To further ensure that we can adequately compare code and non-code models across, we construct language-only baseline datasets for each code mixture. These datasets have the same number of total tokens, but with of those tokens coming from natural language.
4 Experimental Setup
4.1 Model Construction & Training
We use the datasets constructed in section3 as pretraining corpora for causally-masked decoder-only transformer language models (Vaswani etal., 2017; Radford etal., 2019). We construct 12-layer decoder-only models with roughly parameters. Model hyperparameters were chosen following the methdology of Wang etal. (2022) to approximate decoder-only versions of T5-large. We pretrain these models with a base natural language data volume of . This means that all models in the competitive setting were trained with , while the models in the additive setting were trained with , and hence varying between and depending on the mixture; we use a batch size of 128, meaning that models were trained for between and steps, depending on the mixture and setting. For each combination of code mixture and setting, we pretrain models from five different random seeds.
4.2 Evaluation
We measure performance on three compositional generalization benchmarks and, more generally, on BigBench tasks. For each evaluation domain, we quantify the impact that code pretraining has on performance by calculating lines of best fit between performance (e.g., generalization accuracy for the compositional generalization benchmarks or multiple-choice grade for BigBench multiple choice tasks) and code mixture.
4.2.1 Compositional Generalization
Compositional generalization is a measure of how well a learner can generate and interpret novel, licit combinations of primitive pieces which have been previously learned. Originally motivated to describe human linguistic faculty—such as the ability of speakers to produce and understand an infinite number of novel, grammatical sentences—compositionality is also a relevant property of many formal systems, like mathematics or programming languages. We hypothesize that the presence of source code in pretraining data may aid models in making this kind of generalization since source code often contains sequences in which a finite set of primitives (e.g., variable and method identifiers) are broadly combined.
To evaluate whether increased code mixture enables compositional generalization, we finetune our pretrained models on a suite of compositional generalization datasets: COGS (Kim & Linzen, 2020), a semantic parsing task in which natural-language sentences are transformed into a formal semantic representation; COGS-vf (Qiu etal., 2022), a variant of COGS which simplifies the output format; and English Passivization (Mueller etal., 2022), a natural-language transduction task in which synthetically generated active-voice sentences are transformed into passive variants. Each dataset contains training, validation, and generalization splits, where the generalization split is constructed to test licit-but-unattested combinations of familiar primitives. Table1 shows examples of the input and output sequences for each of the datasets.
COGS | A hedgehog ate the cake . |
---|---|
COGS-vf | A hedgehog ate the cake on the bed . |
English Passivization | our vultures admired her walrus above some zebra . |
her walrus above some zebra was admired by our vultures . |
COGS and COGS-vf both divide their generalization split into two parts based on generalization type: either lexical, in which a known primitive is used in a grammatical position it has not been seen in before (e.g., hedgehog in subject position, when it had only been seen during training as an object); or structural, in which a known grammatical structure is used in a novel position (e.g., a prepositional phrase such as on the mat modifying the subject, when in training such phrases only modified objects). Previous studies involving COGS and COGS-vf have found the structural generalization examples in COGS to be much harder than the lexical generalization examples. Reducing the complexity of the output form, as is done in COGS-vf, makes the structural tasks somewhat easier, though not easy. Petty etal. (2024) found that models of a comparable size could attain accuracies near 90% on the lexical generalization examples from COGS but near 0% on the structural examples; on COGS-vf, models were able to attain accuracies greater than 95% on lexical cases and 10% on structural cases.
For all compositional generalization datasets, we finetune models for steps and report the mean full-sequence accuracy (i.e., if every autoregressively-generated token is correct, otherwise) over all examples in the generalization split for each random pretraining seed.
4.2.2 BigBench
We also evaluate models on BigBench (Srivastava etal., 2023), a benchmark of 204 diverse and challenging tasks presented in a common format. We evaluate models in a zero-shot setting, where a question is given in context (e.g., What is 697 times 205? from the 3-digit multiplication task) and the model must either generate the correct label (e.g, (a).) from a provided list of responses (for multiple-choice tasks) or generate the correct answer (for generative tasks). Since our focus is on the effect of code in pretraining on non-code tasks, we exclude from consideration tasks which are explicitly designed to test the capabilities of models at understanding or generating source code. Table2 shows examples of the input and output sequences for the BigBench tasks we discuss in detail.
bb-arithmetic | What is 68824 times 42716? |
---|---|
9033448237, 3839424324, 18962582, 564059290599, banana, house, 2939885984 | |
bb-common-morpheme | What is the common morpheme among these words: pyre, empyrean, antipyretic, pyrotechnics? |
fire, hot, oxygen, medicine | |
bb-fantasy-reasoning | Long ago you had sold your soul to the devil, but the postal service was so utterly bad that they had lost the package where your soul was. Since the transaction was completed before it, you have the benefits of the deal while the devil still has no control over you. Does the devil have any control over your soul now? |
Yes, No | |
bb-general-knowledge | How many legs do horses have? |
two, four, six, three, one, none | |
bb-implicatures | Does Speaker 2’s answer mean yes or no? Speaker 1: ‘But aren’t you afraid?’ Speaker 2: ‘Ma’am, sharks never attack anybody.’ |
yes, no |
5 Results
Code improves compositional generalization for structured outputs.
When we finetune on COGS and COGS-vf, where the output domain has a formal structure, we find that performance improves as the proportion of code increases in both the competitive and additive settings (see fig.2 and table3). The effect is most pronounced for the structural generalization examples from COGS-vf in the competitive and additive settings (regression coefficients and , respectively; this indicates that the best-fit line predicts an accuracy increase of as the proportion of code increases from to ), though all code-mixture models show a non-negative relationship between code mixture and generalization accuracy. Code helped the least on the structural generalization examples from COGS, where absolute performance remained near-zero. In the additive setting, we find that code-mixture models perform as well (on lexical generalization examples) or better (on structural generalization examples) than the equivalent language-only baseline models.
In order for models to generalize compositionally, two things must happen: first, models must correctly generalize the distribution of arguments and predicates to match the true-but-unseen patterns of composition (e.g., they must learn that syntactic objects become arguments to ‘theme’ for all primitives, even those only previously seen as subjects); and they must produce well-formed outputs. Kim & Linzen (2020, §G.2) note that Transformer models in particular often failed at producing syntactically well-formed logical expressions for the generalization examples in COGS. Since code has similar syntactic requirements to those of COGS logical expression (e.g., well-balanced parentheses), the improvement we observe in generalization accuracy may be due to improvements in the well-formedness of outputs, rather than due to better compositional generalization. To test this hypothesis, we compute a very high-level measure of syntactic well-formedness for model outputs—namely, whether or not the decoded logical forms have well-balanced parentheses—and examine how well-formedness varies by code mixture.
Figure3 shows that exposure to code does not, in general, improve the well-formedness of generalization outputs. Only on structural generalization examples from COGS-vf in the additive setting does the regression coefficient exceed ; for all other code-mixture models, increased code mixture has a near-zero or negative impact on syntactic well-formedness (table4). This means that the observed relationship between higher code mixture and generalization accuracy is attributable to models learning better generalizations for argument distribution rather than merely producing more well-formed outputs.
Code improves performance on arithmetic, up to a point.
On multiple-choice multi-digit arithmetic tasks from BigBench, increased code mixture has a generally positive impact on performance. In both competitive and additive settings, higher code mixture results in greater multiple-choice accuracy, with the impact growing more pronounced as the number of digits increases (see fig.4 and table6). In the competitive setting, performance peaks at a code mixture between 40% and 50% and thereafter tends to decrease, though the overall trend remains positive; this inverted-U shaped performance curve also grows more pronounced as the number of digits increases.
Code distracts from linguistic- and world-knowledge.
We also identify cases where increased exposure to code harms performance by looking for tasks whose performance is negatively correlated with code mixture. These tasks include ones which involve purely linguistic knowledge (such as the English Passivization compositional generalization task as well as the Implicatures and Common Morpheme BigBench tasks) as well as those which involve reasoning or world-knowledge (such as the General Knowledge and Fantasy Reasoning BigBench tasks).
Figure5 shows this negative trend on the English Passivization compositional generalization benchmark, where performance (as measured by mean full-sequence accuracy on the generalization split) decreases as code mixture increases in both the competitive and additive settings. Furthermore, in the additive setting the language-only baseline models outperform the code-mixture models. See table5 for exact regression coefficients.
These negative trends show that increased exposure to code during pretraining does not uniformly improve the ability of language models to generalize compositionally independent of the output domain; whereas COGS and COGS-vf, whose output domain is formal logic expressions, benefit from increased code exposure, generalization tasks which involve natural-language output domains appear to obviate any compositionality benefit conferred to models through code exposure. This may make intuitive sense, as decreased exposure to natural language data (in either an absolute or relative sense) may reduce any linguistically-relevant inductive biases models need, in partial conflict with Mueller etal. (2024)’s finding that code pretraining aids syntax-sensitive generalization for in-context learning tasks.
We also find instances of BigBench tasks where code mixture is negatively correlated with performance; Figure6 highlights four such tasks where increased exposure to code during pretraining harms performance in both competitive and additive settings. See table7 for exact regression coefficients.
5.1 The impact of code in aggregate
The results presented above highlight particular cases where code mixture has a noticeable impact on performance, but how does code pretraining affect the remaining BigBench tasks? We want to know how code pretraining impacts performance in aggregate for two reasons. First, we want to know if adding code helps in general: is adding code helpful or harmful for most tasks? Second, since it’s likely that following any type of intervention models will be better at some tasks and worse at others than before the intervention, we want to confirm if the effects of code we observe are statistically significant or could have arisen due to chance.
To answer this, we perform a permutation test on the slopes derived above from best-linear-fits of task performance versus code mixture. We start by taking the underlying performance-by-mixture data and shuffling the independent variable (code mixture) within each task and recompute slopes for the lines-of-best-fit. Figure7 shows the distribution of slopes for the observed (treatment) and counterfactual, permuted (control) data for both settings and metrics. For multiple choice tasks in both settings and for generative tasks in the competitive setting, the distribution of treatment slopes (i.e., those observed) is less concentrated around than the control distribution.
To quantify the difference between these distributions, we compute several different test statistics: the difference of means () as a measure of whether training on code improves task performance on average;the difference of variance () as a measure of whether training on code increases the variance of task performance; the difference of skew () as a measure of whether training on code moves the distribution of task performance asymmetrically;and the differences in upper and lower quartiles () as a measure of whether training on code increases the model’s performance on its best and worst-performing tasks.
We then perform two-sided permutation tests against the null hypothesis that the treatment and control distributions are drawn from the same underlying distribution by combining and randomly-repartitioning the samples times and recomputing each test statistic. We do this test independently for each setting (competitive and additive) and each BigBench question type: multiple choice (MCG) and generative (where performance is measured by BLEU).
Figure8 shows the null distributions for each of the test statistics and the observed values for the multiple-choice questions in the competitive setting, along with the significance scores (-values) for each statistic. We find a statistically significant difference of variance () and upper-quartiles () at a significance level of , indicating that increased code exposure in pretraining does have strong benefits for some tasks, while it increases the variance in downstream task performance in general. Other statistics measured were not significant at this significance level. Results are similar, in general, for other conditions.
6 Discussion
We find that including code in a model’s pretraining corpus influences its performance on downstream, non-code tasks. Adding code improves performance on compositional generalization tasks whose output domain is highly structured, akin to the syntactic constraints of source code. Exposure to code during pretraining also improves performance on arithmetic tasks, an trend which grows more pronounced as the number of digits of the numbers included in those arithmetic tasks increases. Conversely, we also find tasks where increased exposure to code harms model performance, such as compositional generalization tasks involving natural-language output or tasks involving linguistic or real-world knowledge. These trends appear in both a competitive setting, where increases in code data result in reduction of language data, and in a additive setting, where all models see a fixed amount of language data.
Despite the fact that code improves compositional generalization only in cases where the output domain is ‘code-like,’ we find that increased code exposure does not meaningfully improve the syntactic well-formedness of outputs in these cases; rather, the benefit conferred by code is to allow models to better learn the correct generalization for the distribution of arguments. We hypothesize that the deleterious impact of code on tasks involving linguistic or real-world knowledge comes from a reduction in linguistically-relevant inductive biases as models see less natural language data (either in an absolute sense in the competitive setting or a relative sense in the additive setting).
We conduct permutation tests on the distributions of per-task trend lines of performance-by-code-mixture to quantify the impact that code has on performance. We find that, in aggregate, training on code tends to improve performance on BigBench tasks at a statistically-significant level.
6.1 Limitations and Future Work
Scale
We survey relatively small models ( parameters), which limits our ability to establish how code pretraining affects capabilities which require models at the multi-billion parameter scale, like instruction following and advanced in-context learning. We also only consider pretraining corpora of between and tokens.
Data Sources
We treat ‘code’ and ‘language’ as a monolithic and disjoint data sources, but in reality source code contains linguistic data in the form of comments while natural language datasets may contain code-like structures even after cleaning and curation. It is possible that effect sizes would be increased with a more thorough separation of code and language data.
Task Limitations
We study a small set of tasks and evaluation modalities (fine-tuning on compositional generalization benchmarks and zero-shot performance on assorted BigBench tasks). Code pretraining may have impacts on other tasks, and those impacts may differ between fine-tuning, zero-shot, and multi-shot in-context learning.
References
- Anthropic AI (2024)Anthropic AI.The Claude 3 Model Family: Opus, Sonnet, Haiku, 2024.
- Chen etal. (2021)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondedeOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, HeidyKhlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, ClemensWinter, Philippe Tillet, FelipePetroski Such, Dave Cummings, MatthiasPlappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss,WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, IgorBabuschkin, Suchir Balaji, Shantanu Jain, William Saunders, ChristopherHesse, AndrewN. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, PeterWelinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, andWojciech Zaremba.Evaluating large language models trained on code, 2021.
- Coda-Forno etal. (2024)Julian Coda-Forno, Marcel Binz, JaneX. Wang, and Eric Schulz.Cogbench: a large language model walks into a psychology lab, 2024.URL https://arxiv.org/abs/2402.18225.
- Fu & Khot (2022)Hao Fu, Yao;Peng and Tushar Khot.How does gpt obtain its ability? tracing emergent abilities oflanguage models to their sources.Yao Fu’s Notion, Dec 2022.URLhttps://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1.
- Gemini Team etal. (2024)Gemini Team, Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry,Lepikhin, Timothy Lillicrap, Jean baptiste Alayrac, Radu Soricut, AngelikiLazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil,Sebastian Borgeaud, Andrew Dai, Katie Millican, Ethan Dyer, Mia Glaese,Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu,James Molloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, RossMcIlroy, Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford,Erica Moreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton,Zhen Yang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand,Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, PranavShyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, LukeVilnis, Oscar Chang, Nobuyuki Morioka, George Tucker, CeZheng, OliverWoodman, Nithya Attaluri, Tomas Kocisky, Evgenii Eltyshev, XiChen, TimothyChung, Vittorio Selo, Siddhartha Brahma, Petko Georgiev, Ambrose Slone,Zhenkai Zhu, James Lottes, Siyuan Qiao, Ben Caine, Sebastian Riedel, AlexTomala, Martin Chadwick, Juliette Love, Peter Choy, Sid Mittal, Neil Houlsby,Yunhao Tang, Matthew Lamm, Libin Bai, Qiao Zhang, Luheng He, Yong Cheng,Peter Humphreys, Yujia Li, Sergey Brin, Albin Cassirer, Yingjie Miao, LukasZilka, Taylor Tobin, Kelvin Xu, Lev Proleev, Daniel Sohn, Alberto Magni,LisaAnne Hendricks, Isabel Gao, Santiago Ontanon, Oskar Bunyan, Nathan Byrd,Abhanshu Sharma, Biao Zhang, Mario Pinto, Rishika Sinha, Harsh Mehta, DaweiJia, Sergi Caelles, Albert Webson, Alex Morris, Becca Roelofs, Yifan Ding,Robin Strudel, Xuehan Xiong, Marvin Ritter, Mostafa Dehghani, RahmaChaabouni, Abhijit Karmarkar, Guangda Lai, Fabian Mentzer, Bibo Xu, YaGuangLi, Yujing Zhang, TomLe Paine, Alex Goldin, Behnam Neyshabur, Kate Baumli,Anselm Levskaya, Michael Laskin, Wenhao Jia, JackW. Rae, Kefan Xiao, AntoineHe, Skye Giordano, Lakshman Yagati, Jean-Baptiste Lespiau, Paul Natsev,Sanjay Ganapathy, Fangyu Liu, Danilo Martins, Nanxin Chen, Yunhan Xu, MeganBarnes, Rhys May, Arpi Vezer, Junhyuk Oh, Ken Franko, Sophie Bridgers, RuizheZhao, Boxi Wu, Basil Mustafa, Sean Sechrist, Emilio Parisotto,ThanumalayanSankaranarayana Pillai, Chris Larkin, Chenjie Gu, ChristinaSorokin, Maxim Krikun, Alexey Guseynov, Jessica Landon, Romina Datta,Alexander Pritzel, Phoebe Thacker, Fan Yang, Kevin Hui, Anja Hauth, Chih-KuanYeh, David Barker, Justin Mao-Jones, Sophia Austin, Hannah Sheahan, ParkerSchuh, James Svensson, Rohan Jain, Vinay Ramasesh, Anton Briukhov, Da-WoonChung, Tamara von Glehn, Christina Butterfield, Priya Jhakra, MatthewWiethoff, Justin Frye, Jordan Grimstad, Beer Changpinyo, CharlineLe Lan,Anna Bortsova, Yonghui Wu, Paul Voigtlaender, Tara Sainath, Shane Gu,Charlotte Smith, Will Hawkins, Kris Cao, James Besley, Srivatsan Srinivasan,Mark Omernick, Colin Gaffney, Gabriela Surita, Ryan Burnell, Bogdan Damoc,Junwhan Ahn, Andrew Brock, Mantas Pajarskas, Anastasia Petrushkina, SebNoury, Lorenzo Blanco, Kevin Swersky, Arun Ahuja, Thi Avrahami, Vedant Misra,Raoul deLiedekerke, Mariko Iinuma, Alex Polozov, Sarah York, George vandenDriessche, Paul Michel, Justin Chiu, Rory Blevins, Zach Gleicher, AdriàRecasens, Alban Rrustemi, Elena Gribovskaya, Aurko Roy, Wiktor Gworek,Sébastien M.R. Arnold, Lisa Lee, James Lee-Thorp, Marcello Maggioni,Enrique Piqueras, Kartikeya Badola, Sharad Vikram, Lucas Gonzalez, AnirudhBaddepudi, Evan Senter, Jacob Devlin, James Qin, Michael Azzam, Maja Trebacz,Martin Polacek, Kashyap Krishnakumar, Shuo yiin Chang, Matthew Tung, IvoPenchev, Rishabh Joshi, Kate Olszewska, Carrie Muir, Mateo Wirth, AleJakseHartman, Josh Newlan, Sheleem Kashem, Vijay Bolina, Elahe Dabir, Joost vanAmersfoort, Zafarali Ahmed, James Cobon-Kerr, Aishwarya Kamath, ArnarMarHrafnkelsson, LeHou, Ian Mackinnon, Alexandre Frechette, Eric Noland, XianceSi, Emanuel Taropa, Dong Li, Phil Crone, Anmol Gulati, Sébastien Cevey,Jonas Adler, Ada Ma, David Silver, Simon Tokumine, Richard Powell, StephanLee, Kiran Vodrahalli, Samer Hassan, Diana Mincu, Antoine Yang, Nir Levine,Jenny Brennan, Mingqiu Wang, Sarah Hodkinson, Jeffrey Zhao, Josh Lipschultz,Aedan Pope, MichaelB. Chang, Cheng Li, LaurentEl Shafey, Michela Paganini,Sholto Douglas, Bernd Bohnet, Fabio Pardo, Seth Odoom, Mihaela Rosca,CiceroNogueira dos Santos, Kedar Soparkar, Arthur Guez, Tom Hudson, StevenHansen, Chulayuth Asawaroengchai, Ravi Addanki, Tianhe Yu, WojciechStokowiec, Mina Khan, Justin Gilmer, Jaehoon Lee, CarrieGrimes Bostock,Keran Rong, Jonathan Caton, Pedram Pejman, Filip Pavetic, Geoff Brown, VivekSharma, Mario Lučić, Rajkumar Samuel, Josip Djolonga, Amol Mandhane,LarsLowe Sjösund, Elena Buchatskaya, Elspeth White, Natalie Clay, JiepuJiang, Hyeontaek Lim, Ross Hemsley, Zeyncep Cankara, Jane Labanowski,NicolaDe Cao, David Steiner, SayedHadi Hashemi, Jacob Austin, AnitaGergely, Tim Blyth, Joe Stanton, Kaushik Shivakumar, Aditya Siddhant, AndersAndreassen, Carlos Araya, Nikhil Sethi, Rakesh Shivanna, Steven Hand, AnkurBapna, Ali Khodaei, Antoine Miech, Garrett Tanzer, Andy Swing, ShantanuThakoor, Lora Aroyo, Zhufeng Pan, Zachary Nado, Jakub Sygnowski, StephanieWinkler, Dian Yu, Mohammad Saleh, Loren Maggiore, Yamini Bansal, XavierGarcia, Mehran Kazemi, Piyush Patil, Ishita Dasgupta, Iain Barr, Minh Giang,Thais Kagohara, Ivo Danihelka, Amit Marathe, Vladimir Feinberg, MohamedElhawaty, Nimesh Ghelani, Dan Horgan, Helen Miller, Lexi Walker, RichardTanburn, Mukarram Tariq, Disha Shrivastava, Fei Xia, Qingze Wang, Chung-ChengChiu, Zoe Ashwood, Khuslen Baatarsukh, Sina Samangooei, RaphaëlLopezKaufman, Fred Alcober, Axel Stjerngren, Paul Komarek, Katerina Tsihlas,Anudhyan Boral, Ramona Comanescu, Jeremy Chen, Ruibo Liu, Chris Welty, DawnBloxwich, Charlie Chen, Yanhua Sun, Fangxiaoyu Feng, Matthew Mauger, XerxesDotiwalla, Vincent Hellendoorn, Michael Sharman, Ivy Zheng, KrishnaHaridasan, Gabe Barth-Maron, Craig Swanson, Dominika Rogozińska, AlekAndreev, PaulKishan Rubenstein, Ruoxin Sang, Dan Hurt, Gamaleldin Elsayed,Renshen Wang, Dave Lacey, Anastasija Ilić, Yao Zhao, Adam Iwanicki,Alejandro Lince, Alexander Chen, Christina Lyu, Carl Lebsack, JordanGriffith, Meenu Gaba, Paramjit Sandhu, Phil Chen, Anna Koop, Ravi Rajwar,SoheilHassas Yeganeh, Solomon Chang, Rui Zhu, Soroush Radpour, ElnazDavoodi, VingIan Lei, Yang Xu, Daniel Toyama, Constant Segal, Martin Wicke,Hanzhao Lin, Anna Bulanova, AdriàPuigdomènech Badia, Nemanja Rakićević,Pablo Sprechmann, Angelos Filos, Shaobo Hou, Víctor Campos, Nora Kassner,Devendra Sachan, Meire Fortunato, Chimezie Iwuanyanwu, Vitaly Nikolaev,Balaji Lakshminarayanan, Sadegh Jazayeri, Mani Varadarajan, Chetan Tekur,Doug Fritz, Misha Khalman, David Reitter, Kingshuk Dasgupta, Shourya Sarcar,Tina Ornduff, Javier Snaider, Fantine Huot, Johnson Jia, Rupert Kemp, NejcTrdin, Anitha Vijayakumar, Lucy Kim, Christof Angermueller, LiLao, TianqiLiu, Haibin Zhang, David Engel, Somer Greene, Anaïs White, Jessica Austin,Lilly Taylor, Shereen Ashraf, Dangyi Liu, Maria Georgaki, Irene Cai, YanaKulizhskaya, Sonam Goenka, Brennan Saeta, Ying Xu, Christian Frank, DariodeCesare, Brona Robenek, Harry Richardson, Mahmoud Alnahlawi, ChristopherYew, Priya Ponnapalli, Marco Tagliasacchi, Alex Korchemniy, Yelin Kim,Dinghua Li, Bill Rosgen, Kyle Levin, Jeremy Wiesner, Praseem Banzal, PraveenSrinivasan, Hongkun Yu, Çağlar Ünlü, David Reid, Zora Tung, DanielFinchelstein, Ravin Kumar, Andre Elisseeff, Jin Huang, Ming Zhang, RicardoAguilar, Mai Giménez, Jiawei Xia, Olivier Dousse, Willi Gierke, DamionYates, Komal Jalan, LuLi, Eri Latorre-Chimoto, DucDung Nguyen, Ken Durden,Praveen Kallakuri, Yaxin Liu, Matthew Johnson, Tomy Tsai, Alice Talbert,Jasmine Liu, Alexander Neitz, Chen Elkind, Marco Selvi, Mimi Jasarevic,LivioBaldini Soares, Albert Cui, Pidong Wang, AlekWenjiao Wang, Xinyu Ye,Krystal Kallarackal, Lucia Loher, Hoi Lam, Josef Broder, Dan Holtmann-Rice,Nina Martin, Bramandia Ramadhana, Mrinal Shukla, Sujoy Basu, Abhi Mohan, NickFernando, Noah Fiedel, Kim Paterson, Hui Li, Ankush Garg, Jane Park, DongHyunChoi, Diane Wu, Sankalp Singh, Zhishuai Zhang, Amir Globerson, Lily Yu, JohnCarpenter, Félix deChaumontQuitry, Carey Radebaugh, Chu-Cheng Lin, AlexTudor, Prakash Shroff, Drew Garmon, Dayou Du, Neera Vats, Han Lu, ShariqIqbal, Alex Yakubovich, Nilesh Tripuraneni, James Manyika, Haroon Qureshi,Nan Hua, Christel Ngani, MariaAbi Raad, Hannah Forbes, Jeff Stanway, MukundSundararajan, Victor Ungureanu, Colton Bishop, Yunjie Li, Balaji Venkatraman,BoLi, Chloe Thornton, Salvatore Scellato, Nishesh Gupta, Yicheng Wang, IanTenney, Xihui Wu, Ashish Shenoy, Gabriel Carvajal, DianaGage Wright, BenBariach, Zhuyun Xiao, Peter Hawkins, Sid Dalmia, Clement Farabet, PedroValenzuela, Quan Yuan, Ananth Agarwal, Mia Chen, Wooyeol Kim, Brice Hulse,Nandita Dukkipati, Adam Paszke, Andrew Bolt, Kiam Choo, Jennifer Beattie,Jennifer Prendki, Harsha Vashisht, Rebeca Santamaria-Fernandez, LuisC. Cobo,Jarek Wilkiewicz, David Madras, Ali Elqursh, Grant Uy, Kevin Ramirez, MattHarvey, Tyler Liechty, Heiga Zen, Jeff Seibert, ClaraHuiyi Hu, AndreyKhorlin, Maigo Le, Asaf Aharoni, Megan Li, Lily Wang, Sandeep Kumar, NormanCasagrande, Jay Hoover, DaliaEl Badawy, David Soergel, Denis Vnukov, MattMiecnikowski, Jiri Simsa, Praveen Kumar, Thibault Sellam, Daniel Vlasic,Samira Daruki, Nir Shabat, John Zhang, Guolong Su, Jiageng Zhang, JeremiahLiu, YiSun, Evan Palmer, Alireza Ghaffarkhah, XiXiong, Victor Cotruta,Michael Fink, Lucas Dixon, Ashwin Sreevatsa, Adrian Goedeckemeyer, AlekDimitriev, Mohsen Jafari, Remi Crocker, Nicholas FitzGerald, Aviral Kumar,Sanjay Ghemawat, Ivan Philips, Frederick Liu, Yannie Liang, Rachel Sterneck,Alena Repina, Marcus Wu, Laura Knight, Marin Georgiev, Hyo Lee, Harry Askham,Abhishek Chakladar, Annie Louis, Carl Crous, Hardie Cate, Dessie Petrova,Michael Quinn, Denese Owusu-Afriyie, Achintya Singhal, Nan Wei, Solomon Kim,Damien Vincent, Milad Nasr, ChristopherA. Choquette-Choo, Reiko Tojo, ShawnLu, Diego deLasCasas, Yuchung Cheng, Tolga Bolukbasi, Katherine Lee, SaaberFatehi, Rajagopal Ananthanarayanan, Miteyan Patel, Charbel Kaed, Jing Li,ShreyasRammohan Belle, Zhe Chen, Jaclyn Konzelmann, Siim Põder, RoopalGarg, Vinod Koverkathu, Adam Brown, Chris Dyer, Rosanne Liu, Azade Nova, JunXu, Alanna Walton, Alicia Parrish, Mark Epstein, Sara McCarthy, Slav Petrov,Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context, 2024.
- Gemma Team etal. (2024)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, SuryaBhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjayKale, Juliette Love, Pouya Tafti, Léonard Hussenot, PierGiuseppe Sessa,Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros,Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, AntoniaPaterson, Beth Tsai, Bobak Shahriari, CharlineLe Lan, ChristopherA.Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid,Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker,George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, IanTenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski,Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, JohanFerret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, KatieMillican, LarsLowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, MaciejMikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, OlivierBachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov,Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy,Ruibo Liu, Ryan Mullins, SamuelL Smith, Sebastian Borgeaud, Sertan Girgin,Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, TomHennigan, Vlad Feinberg, Wojciech Stokowiec, Yuhui Chen, Zafarali Ahmed,Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet,Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, ZoubinGhahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, ArmandJoulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy.Gemma: Open models based on gemini research and technology, 2024.
- Google (2024)Google.Codegemma: Open code models based on gemma.https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf,2024.
- Groeneveld etal. (2024)Dirk Groeneveld, IzBeltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, OyvindTafjord, AnanyaHarsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, ShaneArora, David Atkinson, Russell Authur, KhyathiRaghavi Chandu, Arman Cohan,Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, WilliamMerrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam,MatthewE. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk,Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, MitchellWortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer,Jesse Dodge, Kyle Lo, Luca Soldaini, NoahA. Smith, and Hannaneh Hajishirzi.Olmo: Accelerating the science of language models, 2024.
- Hoffmann etal. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLasCasas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George vandenDriessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, ErichElsen, JackW. Rae, Oriol Vinyals, and Laurent Sifre.Training compute-optimal large language models, 2022.URL https://arxiv.org/abs/2203.15556.
- Hosseini & Dolan-Gavitt (2022)Iman Hosseini and Brendan Dolan-Gavitt.Beyond the c: Retargetable decompilation using neural machinetranslation.In Proceedings 2022 Workshop on Binary Analysis Research, BAR2022. Internet Society, 2022.doi: 10.14722/bar.2022.23009.URL http://dx.doi.org/10.14722/bar.2022.23009.
- Kim & Linzen (2020)Najoung Kim and Tal Linzen.COGS: A compositional generalization challenge based on semanticinterpretation.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pp. 9087–9105, Online, November 2020.Association for Computational Linguistics.doi: 10.18653/v1/2020.emnlp-main.731.URL https://aclanthology.org/2020.emnlp-main.731.
- Kim etal. (2024)Najoung Kim, Sebastian Schuster, and Shubham Toshniwal.Code pretraining improves entity tracking abilities of languagemodels, 2024.
- Lachaux etal. (2020)Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample.Unsupervised translation of programming languages, 2020.
- Lindemann etal. (2024)Matthias Lindemann, Alexander Koller, and Ivan Titov.Strengthening structural inductive biases by pre-training to performsyntactic transformations, 2024.URL https://arxiv.org/abs/2407.04543.
- Longpre etal. (2024)Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, BarretZoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and DaphneIppolito.A pretrainer’s guide to training data: Measuring the effects ofdata age, domain coverage, quality, & toxicity.In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies(Volume 1: Long Papers), pp. 3245–3276, Mexico City, Mexico, June 2024.Association for Computational Linguistics.URL https://aclanthology.org/2024.naacl-long.179.
- Ma etal. (2024)Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, YuJiang, Changjian Wang, andShanshan Li.At which training stage does code data help LLMs reasoning?In The Twelfth International Conference on LearningRepresentations, 2024.URL https://openreview.net/forum?id=KIPJKST4gw.
- Mueller etal. (2022)Aaron Mueller, Robert Frank, Tal Linzen, Luheng Wang, and Sebastian Schuster.Coloring the blank slate: Pre-training imparts a hierarchicalinductive bias to sequence-to-sequence models.In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.),Findings of the Association for Computational Linguistics: ACL 2022,pp. 1352–1368, Dublin, Ireland, May 2022. Association for ComputationalLinguistics.doi: 10.18653/v1/2022.findings-acl.106.URL https://aclanthology.org/2022.findings-acl.106.
- Mueller etal. (2024)Aaron Mueller, Albert Webson, Jackson Petty, and Tal Linzen.In-context learning generalizes, but not always robustly: The case ofsyntax.In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies(Volume 1: Long Papers), pp. 4761–4779, Mexico City, Mexico, June 2024.Association for Computational Linguistics.URL https://aclanthology.org/2024.naacl-long.267.
- OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom,Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, JakeBerdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, OlegBoiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, MilesBrundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, BrittanyCarey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, FotisChantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, BenChess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, JeremiahCurrier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, DamienDeville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, AdrienEcoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,SimónPosada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges,Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes,Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross,ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, YuchenHe, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey,Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, JoostHuizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang,Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan,Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar,Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim,JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, ŁukaszKondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, GretchenKrueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, JadeLeung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin,Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, KimMalfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, KatieMayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey,Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, LukeMetz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, DanielMossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, ReiichiroNakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, LongOuyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, AshleyPantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, AlexPassos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvilaBelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael,Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power,Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, AdityaRamesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, BobRotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, ShibaniSanturkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman,Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker,Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin,Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher,FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, NikolasTezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, ElizabethTseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe,Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright,JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann,Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, DaveWillner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, SherwinWu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan,Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao,Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2024.
- Papadimitriou & Jurafsky (2023)Isabel Papadimitriou and Dan Jurafsky.Injecting structural hints: Using language models to study inductivebiases in language learning.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findingsof the Association for Computational Linguistics: EMNLP 2023, pp.8402–8413, Singapore, December 2023. Association for ComputationalLinguistics.doi: 10.18653/v1/2023.findings-emnlp.563.URL https://aclanthology.org/2023.findings-emnlp.563.
- Pearce etal. (2022)Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and BrendanDolan-Gavitt.Examining zero-shot vulnerability repair with large language models,2022.
- Petty etal. (2024)Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta, Fei Sha, Dan Garrette,and Tal Linzen.The impact of depth on compositional generalization in transformerlanguage models, 2024.
- Qiu etal. (2022)Linlu Qiu, Peter Shaw, Panupong Pasupat, Pawel Nowak, Tal Linzen, Fei Sha, andKristina Toutanova.Improving compositional generalization with latent structure and dataaugmentation.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimirMezaRuiz (eds.), Proceedings of the 2022 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, pp. 4341–4362, Seattle, United States, July 2022.Association for Computational Linguistics.doi: 10.18653/v1/2022.naacl-main.323.URL https://aclanthology.org/2022.naacl-main.323.
- Radford etal. (2019)Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.Improving language understanding by generative pre-training, 2019.
- Raffel etal. (2023)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer, 2023.
- Razeghi etal. (2024)Yasaman Razeghi, Hamish Ivison, Sameer Singh, and Yanai Elazar.BACKTRACKING MATHEMATICAL REASONING OF LANGUAGE MODELSTO THE PRETRAINING DATA.In The Second Tiny Papers Track at ICLR 2024, 2024.URL https://openreview.net/forum?id=otHhLO7GZj.
- Roziere etal. (2023)Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat,XiaoqingEllen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin,etal.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023.
- Srivastava etal. (2023)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, AbubakarAbid, Adam Fisch, AdamR. Brown, Adam Santoro, Aditya Gupta, AdriàGarriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, AletheaPower, Alex Ray, Alex Warstadt, AlexanderW. Kocurek, Ali Safaya, Ali Tazarv,Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, AmandaDsouza, Ambrose Slone, Ameet Rahane, AnantharamanS. Iyer, Anders Andreassen,Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La,Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, AnimeshGupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi,Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, AshishSabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B.RyanRoberts, BaoSheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt,Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, BerkEkmekci, BillYuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, CameronDour, Catherine Stinson, Cedrick Argueta, CésarFerri Ramírez, ChandanSingh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, ChrisCallison-Burch, Chris Waites, Christian Voigt, ChristopherD. Manning,Christopher Potts, Cindy Ramirez, ClaraE. Rivera, Clemencia Siro, ColinRaffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, DanHendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, DanielLevy, DanielMoseguí González, Danielle Perszyk, Danny Hernandez, DanqiChen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens,Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, DerekChen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, DimitriCoelhoMollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, EkinDogusCubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway,Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem,Ernie Chang, EthanA. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, EuniceEngefuManyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, FernandoMartínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, GauravMishra, GentaIndra Winata, Gerard deMelo, Germán Kruszewski, GiambattistaParascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, GregorBetz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, HannanehHajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze,Hiromu Yakura, Hongming Zhang, HughMee Wong, Ian Ng, Isaac Noble, JaapJumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee,JaimeFernández Fisac, JamesB. Simon, James Koppel, James Zheng, James Zou,Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom,Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, JekaterinaNovikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, JesseEngel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru,John Burden, John Miller, JohnU. Balis, Jonathan Batchelder, JonathanBerant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman,Joseph Guerr, Joseph Jones, JoshuaB. Tenenbaum, JoshuaS. Rule, Joyce Chua,Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, KaterinaIgnatyeva, Katja Markert, KaustubhD. Dhole, Kevin Gimpel, Kevin Omondi, KoryMathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell,Kyle Richardson, Laria Reynolds, Leo Gao, LiZhang, Liam Dugan, Lianhui Qin,Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam,Lucy Noble, Ludwig Schmidt, Luheng He, LuisOliveros Colón, Luke Metz,LütfiKerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, MaheenFarooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, MarcoMaru, Maria JoseRamírez Quintana, Marie Tolkiehn, Mario Giulianelli, MarthaLewis, Martin Potthast, MatthewL. Leavitt, Matthias Hagen, MátyásSchubert, MedinaOrduna Baitemirova, Melody Arnaud, Melvin McElrath,MichaelA. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, MichaelStarritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, MichihiroYasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker,MoTiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini,MukundVarma T, Nanyun Peng, NathanA. Chi, Nayeon Lee, Neta Gur-AriKrakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez,Nikita Nangia, Niklas Deckers, Niklas Muennighoff, NitishShirish Keskar,NivedithaS. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, OmarAgha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo AntonioMoreno Casares,Parth Doshi, Pascale Fung, PaulPu Liang, Paul Vicol, Pegah Alipoormolabashi,Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, PhuMon Htut, PinyuHwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, QiaozhuMei, Qing Lyu, Qinlang Chen, Rabin Banjade, RachelEtta Rudolph, RaeferGabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg,Richard Barnes, RifA. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank,Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, RowanJacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall,Ryan Teehan, Rylan Yang, Sahib Singh, SaifM. Mohammad, Sajant Anand, SamDillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, SamuelR. Bowman,SamuelS. Schoenholz, Sanghyun Han, Sanjeev Kwatra, SarahA. Rous, SarikGhazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann,Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, ShashankSrivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, ShixiangShane Gu, ShubhPachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, SiamakShakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, SnehaPriscilla Makini,Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, StefanDivic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad,StevenT. Piantadosi, StuartM. Shieber, Summer Misherghi, SvetlanaKiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, TariqAli, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, ThomasPhan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, TitusTunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot,Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, VikasRaunak, Vinay Ramasesh, VinayUday Prabhu, Vishakh Padmakumar, VivekSrikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, XiangRen, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh,Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, YidingHao, Yifu Chen, Yonatan Belinkov, YuHou, Yufang Hou, Yuntao Bai, ZacharySeid, Zhuoye Zhao, Zijian Wang, ZijieJ. Wang, Zirui Wang, and Ziyi Wu.Beyond the imitation game: Quantifying and extrapolating thecapabilities of language models, 2023.
- Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, GuillemCucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, SagharHosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux,Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, XavierMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, AndrewPoulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, RuanSilva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang,Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, IliyanZarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurelienRodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models, 2023.
- Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In I.Guyon, U.Von Luxburg, S.Bengio, H.Wallach, R.Fergus,S.Vishwanathan, and R.Garnett (eds.), Advances in Neural InformationProcessing Systems, volume30. Curran Associates, Inc., 2017.URLhttps://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Wang etal. (2022)Thomas Wang, Adam Roberts, Daniel Hesslow, TevenLe Scao, HyungWon Chung,IzBeltagy, Julien Launay, and Colin Raffel.What language model architecture and pretraining objective work bestfor zero-shot generalization?, 2022.
- Yang etal. (2024a)John Yang, CarlosE. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao,Karthik Narasimhan, and Ofir Press.Swe-agent: Agent-computer interfaces enable automated softwareengineering, 2024a.
- Yang etal. (2024b)KeYang, Jiateng Liu, John Wu, Chaoqi Yang, YiR. Fung, Sha Li, Zixuan Huang,XuCao, Xingyao Wang, Yiquan Wang, Heng Ji, and Chengxiang Zhai.If llm is the wizard, then code is the wand: A survey on how codeempowers large language models to serve as intelligent agents,2024b.
- Yao & Koller (2024)Yuekun Yao and Alexander Koller.Simple and effective data augmentation for compositionalgeneralization, 2024.URL https://arxiv.org/abs/2401.09815.
- Ye & Durrett (2022)XiYe and Greg Durrett.The unreliability of explanations in few-shot prompting for textualreasoning, 2022.
- Ye etal. (2023)XiYe, Srinivasan Iyer, Asli Celikyilmaz, Ves Stoyanov, Greg Durrett, andRamakanth Pasunuru.Complementary explanations for effective in-context learning, 2023.
- Zeng etal. (2021)Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, YiLiao, Zhiwei Wang, Xin Jiang,ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao,Xinjing Huang, Jun Wang, Jianfeng Yu, QiGuo, Yue Yu, Yan Zhang, Jin Wang,Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang,Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo,Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong Tian.Pangu-: Large-scale autoregressive pretrained chineselanguage models with auto-parallel computation, 2021.URL https://arxiv.org/abs/2104.12369.
- Zhang etal. (2023)LiZhang, Liam Dugan, Hainiu Xu, and Chris Callison-Burch.Exploring the curious case of code prompts, 2023.
- Zhou etal. (2023)Denny Zhou, Nathanael Schärli, LeHou, Jason Wei, Nathan Scales, Xuezhi Wang,Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and EdChi.Least-to-most prompting enables complex reasoning in large languagemodels, 2023.
Appendix A Regression Coefficients
Dataset | Gen.type | Setting | Baseline | |||
---|---|---|---|---|---|---|
COGS | Lexical | Competitive | — | |||
COGS | Lexical | Additive | False | |||
COGS | Lexical | Additive | True | |||
COGS | Structural | Competitive | — | |||
COGS | Structural | Additive | False | |||
COGS | Structural | Additive | True | |||
COGS-vf | Lexical | Competitive | — | |||
COGS-vf | Lexical | Additive | False | |||
COGS-vf | Lexical | Additive | True | |||
COGS-vf | Structural | Competitive | — | |||
COGS-vf | Structural | Additive | False | |||
COGS-vf | Structural | Additive | True |
Dataset | Gen.type | Setting | Baseline | |||
---|---|---|---|---|---|---|
COGS | Lexical | Competitive | — | |||
COGS | Lexical | Additive | False | |||
COGS | Lexical | Additive | True | |||
COGS | Structural | Competitive | — | |||
COGS | Structural | Additive | False | |||
COGS | Structural | Additive | True | |||
COGS-vf | Lexical | Competitive | — | |||
COGS-vf | Lexical | Additive | False | |||
COGS-vf | Lexical | Additive | True | |||
COGS-vf | Structural | Competitive | — | |||
COGS-vf | Structural | Additive | False | |||
COGS-vf | Structural | Additive | True |
Dataset | Setting | Baseline | |||
---|---|---|---|---|---|
English Passivization | Competitive | — | |||
English Passivization | Additive | False | |||
English Passivization | Additive | True |
Dataset | # of Digits | Setting | |||
---|---|---|---|---|---|
BB Arithmetic JSON | 1 | Competitive | |||
BB Arithmetic JSON | 1 | Additive | |||
BB Arithmetic JSON | 2 | Competitive | |||
BB Arithmetic JSON | 2 | Additive | |||
BB Arithmetic JSON | 3 | Competitive | |||
BB Arithmetic JSON | 3 | Additive | |||
BB Arithmetic JSON | 4 | Competitive | |||
BB Arithmetic JSON | 4 | Additive | |||
BB Arithmetic JSON | 5 | Competitive | |||
BB Arithmetic JSON | 5 | Additive |
Dataset | Setting | |||
---|---|---|---|---|
BB Common Morpheme JSON | Competitive | |||
BB Common Morpheme JSON | Additive | |||
BB Fantasy Reasoning JSON | Competitive | |||
BB Fantasy Reasoning JSON | Additive | |||
BB General Knowledge JSON | Competitive | |||
BB General Knowledge JSON | Additive | |||
BB Implicatures JSON | Competitive | |||
BB Implicatures JSON | Additive |