Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress’s public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people’s ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.
This paper studies the impact of a well-functioning bureaucracy on the effectiveness of repression, in the context of Germany's Nazi regime. I compare former Prussian to non-Prussian municipalities within unified Germany in a regression discontinuity framework. When the Nazis persecuted the German Jews, Prussian areas implemented deportations of Jews more efficiently. During the Weimar republic, when Jews were legally protected, violence against Jews is lower in former Prussian areas. In both periods, Prussian local governments had greater `capacity': They were more effective at raising taxes and collecting trash. Capacity derived from greater specialization and better information processing rather than from effort. Specialization may have created the moral wiggle room to implement repugnant directives.
We use a dataset of the entire population of English Parliamentary enclosure acts between 1750 and 1830 to provide the first evidence of their impact. Parliamentary enclosure led to the systematic rationalization of traditional property rights. Exploiting a feature of the Parliamentary process that produced such legislation as a source of exogenous variation, we show that such enclosures were associated with significantly higher crop yields, but also higher land inequality. Our results are in line with a literature going back to Arthur Young and Karl Marx on the effects of Parliamentary enclosure on productivity and inequality. They do not support the argument that informal systems of governance, even in small, cohesive, and stable communities, were able to efficiently allocate commonly used and governed resources.
American Economic Review, Volume 113, Issue 10, Pages 2507-45, lead article
We test between cooperative and extractive theories of the origins of government. We use river shifts in southern Iraq as a natural experiment, in a new archeological panel dataset. A shift away creates a local demand for a government to coordinate because private river irrigation needs to be replaced with public canals. It disincentivizes local extraction as land is no longer productive without irrigation. Consistent with a cooperative theory of government, a river shift away led to state formation, canal construction, and the payment of tribute. We argue that the first governments coordinated between extended households which implemented public good provision.
Quarterly Journal of Economics, Volume 136, Issue 4, Pages 2093–2145
We examine the long-run economic impact of the Dissolution of the English monasteries in 1535, which is plausibly linked to the commercialization of agriculture and the location of the Industrial Revolution. Using monastic income at the parish level as our explanatory variable, we show that parishes which the Dissolution impacted more had more textile mills and employed a greater share of population outside agriculture, had more gentry and agricultural patent holders, and were more likely to be enclosed. Our results extend Tawney’s famous ‘rise of the gentry’ thesis by linking social change to the Industrial Revolution.
The Review of Economic Studies, Volume 88, Issue 2, Pages 730–763
This paper shows that the intensity of violence in Rwanda's recent past can be traced back to the initial establishment of its precolonial state. Villages that were brought under centralized rule one century earlier experience a doubling of violence during the state-organized 1994 genocide. Instrumental variable estimates exploiting differences in proximity to Nyanza -- an early capital -- suggest these effects are causal. In other periods, when the state faced rebel attacks, with longer state presence, violence is lower. Using data from several sources, including a lab-in-the-field experiment across an abandoned historical boundary, I show that the effect of the historical state is primarily sustained by culturally transmitted norms of obedience. The persistent effect of the precolonial state interacts with government policy: Where the state developed earlier, there is more violence when the Rwandan government mobilized for mass killing and less violence when the government pursued peace.
Short papers and invited submissions
Economica, Volume 89, Issue S1, Supplement: Centenary Issue: 1921 – 2021, Pages S137-S159
What is the impact of warfare on inequality and the social contract? Using local data on bombing, the evolution of wealth inequality and vote shares for the Labour Party in Britain around World War II we establish two results. First, on average, we find no impact of bombing on inequality. However, there is considerable heterogeneity and this result is driven by the southern Britain. In northern Britain bombing led to significant falls in inequality. Second, heavier bombing led to a significant increase in the vote share for Labour after the War everywhere, but this effect is transitory in the south while it is permanent in the north. Our results obtain both in a simple difference-in-differences framework as well as in a panel-regression discontinuity framework in which we exploit the limited range of German fighter escort planes. Our results provide novel causal evidence for the inequality reducing impact of warfare and we interpret them as consistent with the notion that the impact of the War also led to a reconfiguration of the social contract in Britain.
Cliometrica, Volume 16, Issue 2, Pages 369-404
In the late 9th century rural settlement, agriculture, and urbanization all collapsed in Southern Mesopotamia. We first document this collapse using newly digitized archaeological data. We then present a model of hydraulic society that highlights the collapse of state capacity as a proximate cause of the collapse of the economy, and a shortened horizon of the ruler as a potential driver of the timing of the collapse. Using cross sections of tax collection data for 27 districts in southern Mesopotamia in 812, 846, and 918 we verify that the proximate cause of the crisis was the collapse in state capacity, which meant that the state no longer maintained the irrigation system. A particularly destructive succession struggle, shortening the investment horizon of rulers, determined the timing of the crisis.
In Carol Lancaster and Nicolas Van de Walle eds. Handbook on the Politics of Development, Oxford University Press.
In this paper we evaluate the impact of colonialism on development in Sub-Saharan Africa. In the world context, colonialism had very heterogeneous effects, operating through many mechanisms, sometimes encouraging development sometimes retarding it. In the African case, however, this heterogeneity is muted, making an assessment of the average effect more interesting. We emphasize that to draw conclusions it is necessary not just to know what actually happened to development during the colonial period, but also to take a view on what might have happened without colonialism and also to take into account the legacy of colonialism. We argue that in the light of plausible counterfactuals, colonialism probably had a uniformly negative effect on development in Africa. To develop this claim we distinguish between three sorts of colonies: (1) those which coincided with a pre-colonial centralized state, (2) those of white settlement, (3) the rest. Each have distinct performance within the colonial period, different counterfactuals and varied legacies.
We present the asymptotic variance-covariance matrix for M-estimators, and show how it can be used to compute spatial standard errors for a large number of commonly used (non-linear) estimators. We consider OLS, Logit, Probit estimators, Poisson and Negative Binomial regressions, and the special STATA estimators areg and regdhfe. We provide STATA and Python software to implement our findings.
OCR History is a python wrapper around Google Vision and Amazon Textract account that allows for simple prototyping of document digitization. It allows preprocessing such as cropping, grayscale conversion, contrast/brightness adjustment, and splitting into subimages. It returns dataframes for tabular inputs.