Turn decades of morgue files, clipping envelopes, and back-issue archives into a searchable investigative research tool — find historical precedents, track story threads, and surface context that changes today's coverage.
Newsroom morgue files — the envelopes of physical clippings, contact sheets, and annotated tearsheets that newspapers kept for decades before digital — are among the most research-dense archives in existence. Each envelope is a curated collection of everything the newsroom knew about a beat, source, or story thread. Today they sit in filing cabinets, too valuable to discard but too labor-intensive to access. When an investigative team needs historical context on a story, they either call the morgue librarian (if one still exists) or write without the institutional memory that used to be standard.
Historical precedent transforms good journalism into great journalism. Knowing that a company was cited for safety violations in 1987, that a politician's position shifted after a 1994 campaign donor, or that a neighborhood's flooding problem was documented in 1963 — this context comes from morgue files. Losing access to institutional memory because it's in paper envelopes doesn't just inconvenience reporters; it produces coverage that misses the long view that makes investigative journalism consequential.
Batch processing via Google Drive — scan clipping envelopes into shared folders and have them processed automatically overnight without individual uploads
Content classification automatically separates wire service clips, staff-written articles, syndicated columns, and display advertisements within the same envelope scan
Semantic search finds historical precedents by concept — 'waterfront development disputes' surfaces all relevant clippings even if they use different terminology
RAG chat (AI Librarian) answers research questions across the full morgue archive with source citations — like having a veteran researcher on call
Cross-publication search when multiple papers' archives are combined — find how the same story was covered regionally or nationally across different mastheads
Searchable PDF export preserves original clipping images with OCR text layer, maintaining visual evidence of original publication context
Regional newspaper digitizing 60 years of morgue clipping envelopes covering local government, business, and crime beats for investigative team access
Journalism school processing the donated archive of a defunct city paper to create a research resource for faculty and graduate students
Documentary production company digitizing historical footage logs, print archives, and research files for a long-form investigative series
National news organization making its pre-digital back-issue archive searchable for its own reporters before a major anniversary project
News organizations with active investigative teams typically operate on the Professional tier ($149/month) for ongoing research access; organizations with large legacy archives undergoing bulk digitization programs typically use the Institution tier ($499/month).
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
Each page is processed as a separate document. Full newspaper pages go through the layout-aware multi-column pipeline. Individual clippings (smaller scans) are processed through the document pipeline. Handwritten annotations are harder — the OCR model will attempt to transcribe them but handwriting accuracy is lower than typeset text. Annotations are included in the extraction but may require manual review.
Yes. By default, all uploaded archives are private to your account. Public portal access is an opt-in feature per document. For organizations that want multiple team members to access the same archive, Institution tier accounts support team access under a single subscription.
Keyword search finds exact words or phrases. Semantic search finds documents by meaning — it uses AI vector embeddings to understand that 'police misconduct' and 'use of force' and 'officer-involved shooting' are related concepts, and retrieves relevant articles even when they don't share your exact search terms. For investigative research, this is the difference between finding 20 results and finding 200 contextually relevant ones.
Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.