Dataset Viewer
Auto-converted to Parquet Duplicate
Web_URL
string
newspaper_title
string
place_of_publication
string
lccn
string
issue_date
string
edition_order
string
Page
string
thumbnail_url
string
jpeg2000_url
string
pdf_url
string
ocr_url
string
ocr_text
string
source_record_create_date
date32
https://www.loc.gov/resource/sn82014385/1809-07-08/ed-1/?sp=1
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-08
1
1
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0074.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0074.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0074.xml
V» £ THE DELAWARE GAZETTE. VOL. I.] WILMINGTON, S ATU RDAY, JULY [NO . 1. 8 , 1809 . Printed and published On Wednesdays a'.fid Saturdays , BY JOSEPH JONES, In Market street, a few doo(is above the Bank of Delaware. CONDITIONS. I. The Diuwabi Gazsttb shall be published every Wednesday and Saturday, on a large folio she...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-08/ed-1/?sp=2
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-08
1
2
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0075.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0075.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0075.xml
Md"? Agricultural, From the Rvlcigh (N. CJ Star . Extract of a letter f r om a friend and Cor respondent at Tarborough. *• Your publication respecting the B'-tmi has excited the attention of subscrib ts in this quarter. — your They are desirous of witnessing the reality of what is *aid about it, and have desired m: to ...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-08/ed-1/?sp=3
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-08
1
3
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0076.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0076.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0076.xml
( g. Commerce— may its eagle wings be clip of | ped when it tends to the reduction of our inde pendence .—1 gun. 9 cheers. 10. The army of the U". States—may it con tinue till its reduction be required by the sove the reigns of the land.— 1 gun. 9 cheers. fl. The militia ofthe U. States—may they possess the bravery and...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-08/ed-1/?sp=4
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-08
1
4
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0077.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0077.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809070801/0077.xml
1 fcOUl a HOOK MARY. POOR Mary was lovely, ami over her .head, Hut eighteen green summers hud glided uwav ; Young Edwin (just twenty) besought her to wed. And lair was the promise of their bridal day. Not a nymph in the village but envy'd the maid— So graceful, so modest, so winning her air ; Not a swain, but for Mary ...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-12/ed-1/?sp=1
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-12
1
1
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0078.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0078.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0078.xml
v* * ,v. VOL. I.] WILMINGTON, WEDNESDAY, JULY 12, 1809. . ÇNO, 2, Printed and Published On Wednesdays and Saturdays , BY JOSEPH JONES, In Market street, a few doors above the Bank af Delaware. CONDITIONS. Î. Th* DcLawahe Gazette shall be published every Wednesday and Saturday, on a large folio sheet. II. The price shal...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-12/ed-1/?sp=2
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-12
1
2
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0079.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0079.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0079.xml
3latüS of the flinton* [By Authority.] AN ACT To fix the time for the next meeti ng of Congress. BE it enacted by the Senate and House of Representatives of the United States oj Ame rica, in Congress assembled , I hat alter the adjournment of the present session, the next'meeting of Congress shall be on the fourth Mond...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-12/ed-1/?sp=3
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-12
1
3
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0080.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0080.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0080.xml
À in us at the endurance of them, and will eon* duct toward the national government and its several officers with suitable deference and mo m deration ; that we do however despair of obtain in g any redress of these grievances, from that government, while its iirincijial offices are fill as a t present ; and that our o...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-12/ed-1/?sp=4
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-12
1
4
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0081.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0081.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0081.xml
fêoetrp. a ti-ffi UTTI.E SHIPWRECK'D MARINER. ON the point of atock jutting o'er the green ocean, A poor little Mariner thus mourn'd his lot : Cease, cease, cruel billows, your raging emotion, My messmates and playfellows now mind it not ! tenderly cherish'd, Fond hopes but this morning we Kind friends and relations er...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-15/ed-1/?sp=1
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-15
1
1
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0082.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0082.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0082.xml
Printed and Published O/î Wednesdays and Satardays y BY JOSEPH JONES, ■Kin Market street, a few doors above the Bank of Delaware. CONDITIONS, ul. The Delaware Gazette shall be published every Wednesday and Saturday, on a large folio sheet. ■ II. The price shall be tour dollars per annum, ijMM exclusive of postage, paya...
2017-05-24
https://www.loc.gov/resource/sn82014385/1809-07-15/ed-1/?sp=2
The Delaware gazette
Wilmington [Del.]
sn82014385
1809-07-15
1
2
https://tile.loc.gov/ima…25/0/default.jpg
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0083.jp2
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0083.pdf
https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071501/0083.xml
^Intelligence. Lisbon, June 4. Ac^unt of the battle, fought on the 10 t/i of April near Fontanafreda. the In their retreat, the French left one regi of the line. No. 35, in Padernoue.— mem This regiment, commanded by adjutant ge nel neral Dngumir and Col Uressieau, was sur prised by a strong body of Austrians, and comp...
2017-05-24
End of preview. Expand in Data Studio

Dataset Card for Chronicling America: Historic American Newspapers 1770–1810

Dataset Summary

A dataset drawn from the Library of Congress Chronicling America digital collection, part of the National Digital Newspaper Program (NDNP). This dataset includes page-level records with images, OCR text, and publication metadata for newspapers published between 1770 and 1810. It provides a foundation for research, machine learning, and public history projects in honor of the 250th anniversary of the United States.

Dataset Description

For the initiative Revolution Crossroads, the Smithsonian Institution prepared this dataset using data, metadata, and digital objects publicly available from the Chronicling America digital collection.

The dataset consists of text and images from newspapers published in the United States between 1770 and 1810. It was created by the Library of Congress as part of the National Digital Newspaper Program (NDNP), a partnership between the Library and the National Endowment for the Humanities (NEH).

The dataset includes:

  • Approximately 340,000 page-level records representing U.S. newspapers published between 1770 and 1810
  • Metadata fields exported from the Chronicling America API and converted to Parquet format for analysis
  • Links to digital surrogates (page images, thumbnails, PDFs, and OCR XML files) with persistent identifiers
  • Extracted text (OCR) for all records

This dataset was prepared to support research and experimentation at the intersection of cultural heritage and artificial intelligence. It provides a structured corpus for examining Revolutionary-era newspapers, testing OCR performance on historical typography, and developing tools for large-scale text analysis, visualization, and discovery.

Dataset Details

  • Prepared by: Smithsonian Institution, Office of Digital & Innovation staff
  • Shared by: Revolution Crossroads
  • Language(s): English
  • License: Public domain

Dataset Sources

  • Repository: Chronicling America, Library of Congress
  • Program: National Digital Newspaper Program (NDNP), a partnership between the Library of Congress and the National Endowment for the Humanities

Curation Rationale

This dataset was prepared as part of the Revolution Crossroads project in recognition of the 250th anniversary of the United States. It makes a large-scale collection of Revolutionary-era newspapers available in a structured format to enable a wide range of applications, including:

  • Studying the circulation of news and ideas in the early United States
  • Exploring themes, language, and public opinion during the Revolutionary era and early republic
  • Evaluating OCR quality on historical typefaces and 18th-century print conventions (e.g., “long s” characters)
  • Developing retrieval and discovery tools for large-scale historical text corpora
  • Supporting genealogy, public history, and educational projects that use newspapers as sources

Dataset Creation

This dataset was assembled for Hugging Face by the Smithsonian Institution’s Office of Digital & Innovation staff as part of the Revolution Crossroads project. Data was accessed from the Chronicling America “data” backend (https://chroniclingamerica.loc.gov/data/) to avoid API rate limits and reduce load on Library of Congress servers.

Batch information containing issue lists was retrieved, filtered to the target date range (1770–1810), and issue metadata was downloaded in MODSXML format along with OCR text in ALTOXML format. These sources were then combined and flattened so that each record corresponds to a single page of a newspaper issue, and stored in Parquet format for efficient analysis and hosting on Hugging Face.

Data Collection and Processing

Processing Steps

  • Accessed Chronicling America “data” backend (https://chroniclingamerica.loc.gov/data/) to retrieve batch information and issue lists
  • Filtered issues to the Revolutionary-era date range (1770–1810)
  • Downloaded issue metadata in MODSXML (bibliographic metadata) format and OCR text in ALTOXML (OCR text + structural information) format
  • Additional newspaper title metadata (newspaper_title and place_of_publication) was pulled from the Library of Congress API, based on the lccn list from the filtered issues.
  • Combined metadata and text into a single dataset keyed to issue/page identifiers
  • Flattened into one record per newspaper page
  • Stored in Parquet format for efficient analysis and hosting on Hugging Face

Supporting Files Available

XML-formatted data from data backend are available in the Files tab:

  • "mods_xml_records" - contains the MODS XML source files for the records from the dataset, stored in directories for each of newspaper lccns (title series numbers). The file naming convention is "[lccn][date][edition_order].xml".
  • "ocr_alto_xml" - contains the Alto XML source files for the records from the dataset, stored in directories for each of newspaper lccns (title series numbers). The file naming convention is "[date]_[edition_order].xml".

PDF versions of each page image are available in the Files tab:

  • "media" contains the pdf source files for the records from the dataset, stored in directories for each of newspaper lccns (title series numbers). The file naming convention is "[issue_date][edition_order][page].xml". -- please note, the jpeg2000 files, available via the links in the dataset, are of a higher resolution than the pdfs.

Quality Considerations

  • OCR accuracy varies depending on print quality, typography, and page condition
  • Historical typesetting, e.g., “long s” characters resembling “f”, or "ligatured ct" resembling a "d" or an "f", may result in misread text
  • Metadata reflects the Catalog at the time of export and is updated by the Library of Congress on an ongoing basis

Dataset Structure

Each record in the dataset corresponds to a single newspaper page. Issue-level metadata (title, place, date, edition) is repeated across all pages of that issue.

Data Fields

Core Identifiers

  • web_url (string)
    URL to the newspaper issue or page in Chronicling America.
    Example: [https://www.loc.gov/item/sn82016139/1776-07-15/ed-1/](https://www.loc.gov/item/sn82016139/1776-07-15/ed-1/)

  • lccn (string)
    Library of Congress Control Number for the newspaper title.
    Example: sn83045110

  • newspaper_title (string)
    Title of the newspaper.
    Example: The Pennsylvania Packet, and Daily Advertiser

Publication Metadata

  • place_of_publication (string)
    City and state of publication.
    Example: Philadelphia, Pennsylvania

  • issue_date (string)
    Date the issue was published.
    Example: 1777-05-14

  • edition_order (string)
    Edition number/order within the day. The default is 1. Higher numbers indicate additional versions of the same issue published or digitized on that date. These may result from different scans of the same microfilm, rescans from original print, or intentional duplicates to improve OCR quality.
    Example: 1

Page-Level Metadata

  • page (string)
    Page number within the issue.
    Example: 1

Media Fields

  • thumbnail_url (string)
    URL to a thumbnail image of the page.
    Example:
    https://tile.loc.gov/image-services/iiif/service:sgp:sgpbatches:misctop:batch_dlc_misctopsn82016139_ver01:data:sn82016139:print:1776071501:0004/full/pct:6.25/0/default.jpg#h=318&w=202

  • pdf_url (string)
    URL to the PDF of the page.
    Example:
    https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0079.pdf

  • ocr_url (string)
    URL to the ALTOXML OCR file for the page.
    Example:
    https://tile.loc.gov/storage-services/service/ndnp/deu/batch_deu_kedavra_ver01/data/sn82014385/00271740232/1809071201/0079.xml

Extracted Text (OCR)

  • ocr_text (string)
    OCR text extracted from the page image. Present for all records.
    Example excerpt:
    "In CONGRESS, July 4, 1776. A DECLARATION by the Representatives of the UNITED STATES OF AMERICA..."

Record Creation

  • source_record_create_date (date)
    Date of catalog record creation. Example excerpt:
    2017-05-24

Source Data

Collection

The dataset is drawn from the Chronicling America Historic American Newspapers digital collection from the Library of Congress, which contains images, full text (OCR), and metadata for over 23 million digitized public domain newspaper pages originally published in the United States from the 1700s through 1963. Chronicling America is a product of the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress.

Program

The Chronicling America Historic American Newspapers collection provides access to select digitized newspaper pages produced by the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress. As part of the program, cultural heritage institutions apply for and receive awards from an NEH award program to select and digitize newspaper pages representing the history, geographic coverage, and events of note for their state or territory. Visit the Library of Congress' NDNP website for more information on program guidelines. As part of its role in the NDNP, the Library of Congress also contributes digitized newspaper pages from its own collections to Chronicling America.

Historical Background

Newspapers are frequently cited as being the “first draft" of U.S. history, capturing the day-to-day life of a community better than any other published record. With the goal of preserving this important format, from 1982 through 2011, the Library of Congress and the National Endowment for the Humanities collaborated to fund and manage the United States Newspaper Program (USNP), a highly successful effort to locate, catalog, and preserve newspapers published throughout the United States. USNP projects were established and funded in each state and territory to survey every possible repository in an attempt to locate extant issues of every newspaper, inventory and catalog those titles in a national database, and preserve endangered files on microfilm following national and international preservation standards.

In 2003, the NEH and the Library embarked on another partnership, the National Digital Newspaper Program (NDNP). The NDNP is a long-term effort that builds on the legacy of the USNP by increasing access through digitization to the valuable information generated by the USNP. The NDNP provides an opportunity for institutions to select and contribute digitized newspaper content to a freely accessible, national newspaper resource, Chronicling America.

Original Format

The newspapers in Chronicling America were originally published in newsprint, however participants in NDNP primarily scan from 35mm negative microfilm on which the newsprint was captured. Microfilm standards vary, but many historical newspapers in this dataset were microfilmed under the auspices of the United States Newspaper Program (USNP), 1982–2011. While the technical approach for NDNP digitization is primarily based on images scanned from negative microfilm, some amounts of NDNP batches may consist of images scanned from the original newsprint, positive microfilm, microfiche, or preservation facsimile copies, provided that technical specifications are followed. The original format type is specified in the metadata.

Source Data Producers

The source data (historical newspapers) was produced by people affiliated with the newspapers (reporters, editors, contributors, etc).

Rights

The Library of Congress believes that the newspapers in Chronicling America are in the public domain or have no known copyright restrictions. Newspapers published in the United States more than 95 years ago are in the public domain in their entirety. Responsibility for making an independent legal assessment of an item and securing any necessary permissions ultimately rests with persons desiring to use the item.

For more information on Library of Congress policies and disclaimers regarding rights and reproductions, see https://www.loc.gov/legal/

Personal and Sensitive Information

No known personal or sensitive information is included beyond what appeared in publicly circulated newspapers of the time. Users should be aware that newspapers may contain outdated or offensive terminology reflecting the period in which they were created.

Considerations for Using the Data

Risks and Limitations

The dataset contains historical materials with language that does not always match the language preferred by members of the communities depicted. It may include negative stereotypes or words that offend. These materials reflect the views of their creators, not the Library of Congress or the United States government. Historical materials may also contain factual errors and should be understood in the context of their particular time and place.

Machine-readable text in the dataset was created using Optical Character Recognition (OCR). OCR is an automated process that converts the visual image of text into machine-readable characters, allowing computer software to search for words, phrases, numbers, or other elements. While OCR is a powerful tool, errors in the process are unavoidable and are present throughout the data, particularly when source images are degraded or typography is unusual.

For content published before 1810, historical typesetting, e.g., “long s” characters resembling “f”, or "ligatured ct" resembling a "d" or an "f", may result in misread text."

Metadata and digitized content are updated by the Library of Congress on an ongoing basis.

Recommendations

Researchers should validate extracted information against images when accuracy is critical, and be cautious about treating OCR text as complete or authoritative. Consultation of the Chronicling America collection is recommended for the most up-to-date records.

Additional Information

Citation Information

BibTeX

@dataset{RevolutionCrossroads_LOC_2025, author = {Revolution Crossroads Project Team}, title = {Chronicling America: Historic American Newspapers 1770–1810}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/RevolutionCrossroads/loc_chronicling_america_1770-1810}, doi = {10.57967/hf/6526} }

APA

Revolution Crossroads Project Team. (2025). Chronicling America: Historic American Newspapers 1770–1810 [Data set]. Hugging Face. https://doi.org/10.57967/hf/6526

Glossary

  • LCCN: Library of Congress Control Number, a unique identifier for newspaper titles
  • OCR (Optical Character Recognition): Machine-generated text created from scanned page images. Present for all records in this dataset
  • NDNP (National Digital Newspaper Program): A partnership between LOC and NEH to digitize historic newspapers
  • USNP (U.S. Newspaper Program): A precursor to NDNP, conducted 1982–2011 to inventory microfilmed newspapers nationwide
  • MODSXML: Metadata Object Description Schema, a Library of Congress XML format for bibliographic metadata (used for issue-level information in this dataset)
  • ALTOXML: An XML schema for storing OCR text and layout information (used for page-level OCR in this dataset)
  • Issue: A single publication of a newspaper on a given date, often containing multiple pages. In this dataset, issue-level metadata (title, place, date, edition) is repeated for each page of that issue.

Dataset Card Contact

revolutioncrossroads@si.edu

Downloads last month
17,578

Collection including RevolutionCrossroads/loc_chronicling_america_1770-1810