Research Document Archive

Computational analysis of 234,630 declassified U.S. government documents

ML Pipeline Results

234.6K

Documents

3.2M

Pages OCR'd

31.0M

Named Entities

2.9M

Entity Links

59,830

Redactions Found

288

Topic Clusters

Classification Stamps Detected

16,501

UNCLASSIFIED

13,736

SECRET

10,730

CLASSIFIED

6,739

EXEMPT

5,554

CONFIDENTIAL

4,722

RESTRICTED

Document Collections

House Resolutions

181,092

Docs

2,719,832

Pages

34.2 GB

Size

JFK Assassination Records

35,979

Docs

241,860

Pages

22.5 GB

Size

CIA Stargate Program

13,937

Docs

100,056

Pages

5.4 GB

Size

CIA MKUltra

1,936

Docs

64,244

Pages

3.4 GB

Size

CIA Declassified

1,605

Docs

29,744

Pages

2.4 GB

Size

DOJ Disclosures

60

Docs

60

Pages

3.9 MB

Size

Lincoln Archives

21

Docs

9,330

Pages

962.9 MB

Size

GitHub HuggingFace Kaggle 234K docs / 3.1M pages / 30.9M entities / 13-step ML pipeline