Website Archiving API — Scheduled Captures You Control

Updated Jun 27, 2026

Websites disappear. Pages get edited, redirected, or taken offline without warning. A competitor rewrites their pricing page and you have no record of what it said yesterday. A regulatory body updates their guidance and the previous version vanishes. A news article gets retracted, a product listing gets pulled, a terms-of-service page loses a liability clause — and if you didn't save it, it's gone. The Wayback Machine captures some of the internet some of the time, but you can't control what it crawls, when it crawls, or whether it crawls your target at all.

If you need reliable, scheduled snapshots of specific web pages — not a random crawl that might cover your URL next month — you need a website archiving API that runs on your schedule and captures exactly what you point it at. One API call per page, a cron job for the schedule, and a storage bucket for the results. That's the entire architecture.

Why the Wayback Machine isn't enough for serious archiving

The Internet Archive does incredible work preserving the public web. But it was built for broad historical preservation, not for targeted, reliable archiving of specific URLs. Its crawl schedule is unpredictable — some pages get captured weekly, others go months or years between snapshots. Pages behind robots.txt directives are excluded entirely. And since early 2026, several major news publishers have restricted Wayback Machine access over AI scraping concerns, shrinking coverage further.

For legal evidence, compliance records, or competitive intelligence, "maybe it got crawled last month" isn't good enough. You need captures at known intervals — every hour, every day, every week — with consistent rendering parameters so you can compare screenshots over time. You need to control the viewport, the format, and what gets blocked (cookie banners, chat widgets, ad rotations). The Wayback Machine gives you none of that.

A website archiving API flips the model. As a Wayback Machine alternative that you actually control, it lets you tell the API exactly which URL to capture, exactly when, and exactly how. The result is a pixel-perfect screenshot or PDF that you store wherever you want — S3, Google Cloud Storage, your own server. You own the archive.

Capturing a page for archiving

An archiving capture needs to be thorough and consistent. Full-page mode catches everything below the fold. PNG format preserves pixel-perfect fidelity. Cookie and ad blocking removes elements that change on every load and aren't part of the actual page content. Here's a capture request configured for archiving:

curl -G "https://api.screenshotrun.com/v1/screenshots/capture" \
  --data-urlencode "url=https://example.com/terms-of-service" \
  -d "full_page=true" \
  -d "format=png" \
  -d "width=1280" \
  -d "height=800" \
  -d "block_cookies=true" \
  -d "block_ads=true" \
  -d "delay=3" \
  -d "cache_ttl=0" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output "archive-2026-06-27.png"

The cache_ttl=0 parameter is critical for archiving — it forces a fresh render every time. Without it, you might get a cached version from a previous request, which defeats the purpose of capturing the page's current state. The delay=3 gives JavaScript-heavy pages time to finish rendering. If the page loads content dynamically, swap delay for wait_for_selector with a CSS selector that appears only after the real content loads.

API parameters for reliable archiving

Parameter	Recommended value	Why it matters for archiving
`full_page`	`true`	Captures the entire page, not just the visible viewport. Legal clauses, pricing footnotes, and policy details often sit below the fold. Full-page capture ensures nothing gets cut off.
`format`	`png` or `pdf`	PNG for pixel-perfect visual records. PDF for text-searchable documents with metadata. Choose based on whether you need to compare images over time (PNG) or produce printable evidence (PDF).
`width`	`1280`	Standard desktop viewport. Keep it consistent across all captures so archives are comparable.
`block_cookies`	`true`	Removes GDPR/cookie consent banners that obscure page content. Cookie blocking ensures the archive shows the actual page, not a modal overlay.
`block_ads`	`true`	Ads rotate on every load. They're not part of the page content you're archiving — they're noise.
`cache_ttl`	`0`	Forces a fresh capture. Cached results would archive stale content.
`delay`	`2-3`	Gives client-side JavaScript time to render. Pages with lazy loading or API-driven content need this breathing room.
`extract_metadata`	`true`	Returns page title, description, and other meta tags alongside the screenshot. Useful for cataloging archives.

For pages that load content from APIs or render as single-page apps, a fixed delay isn't always reliable. The wait_for_selector parameter lets you wait until a specific element appears in the DOM — a data table, a pricing grid, a terms section. That guarantees the content you're archiving has actually loaded before the capture fires.

Who archives websites and why

Legal teams capture web pages as evidence for litigation. Trademark infringement, defamation, contract disputes — if the opposing party can edit or delete the page before trial, the screenshot is all you have. Courts accept properly timestamped screenshots as evidence, but the burden of proof falls on demonstrating that the capture is authentic and unaltered. Regular, automated captures with consistent parameters strengthen that case significantly compared to a manual screenshot taken once in a browser.

Compliance departments in regulated industries need records of what their own websites displayed at specific points in time. Financial services firms under SEC and FINRA rules must retain copies of public-facing communications, including web pages. Healthcare organizations subject to HIPAA need to document what patient-facing information was published. Insurance companies need records of policy terms as they appeared to customers. An automated website archiving pipeline running on a daily schedule builds these records without manual intervention.

Competitive intelligence teams track competitor websites for pricing changes, feature announcements, and messaging shifts. If you're monitoring 20 competitor pages and a pricing page changes overnight, you want last night's version on file. This overlaps with website change monitoring — the difference is that monitoring focuses on detecting that something changed, while archiving focuses on preserving what the page looked like at a specific moment.

Researchers and journalists archive web content that might be edited or removed. News articles get retracted, social media posts get deleted, government pages get revised without changelogs. A systematic archive preserves the record even if the original source changes or disappears.

Product teams archive their own sites before and after major releases. If a deploy breaks the marketing site or introduces unintended copy changes, having a pre-deploy archive lets you see exactly what changed. This pairs naturally with visual regression testing — the testing catches bugs in CI, while the archive preserves the production state for auditing.

Choosing the right format: PNG vs PDF vs WebP

Format	Best for	Trade-off
PNG	Visual comparison, pixel-level diffing, monitoring pipelines	Larger files, but lossless — every pixel is preserved exactly. Essential if you plan to run automated comparisons between archive snapshots.
PDF	Legal evidence, compliance records, printable documents	Text remains searchable and selectable. Widely accepted in legal proceedings. Use `pdf_page_format=a4` for standard document sizing. PDF capture produces print-ready output.
WebP	Long-term storage of large archives where file size matters	70-80% smaller than PNG with near-lossless quality. Good for archives where you need visual reference but don't plan pixel-level comparison. Not ideal for legal evidence where lossless fidelity matters.

For most archiving workflows, PDF is the strongest choice. It preserves the visual layout, keeps text searchable, embeds metadata, and is universally accepted in legal and compliance contexts. If you also need to run visual comparisons between archive snapshots, capture in both PNG and PDF — one for machine comparison, one for human review and recordkeeping. I covered the full breakdown of format options and quality settings on the feature page — the short version is that PNG and PDF are the only formats that make sense for archiving where fidelity matters.

Automating an archive pipeline in Node.js

This script captures a list of URLs and saves each screenshot with a timestamped filename. Run it on a cron schedule — daily, weekly, or whatever frequency your archiving requirements demand.

const fs = require('fs');
const path = require('path');

const API_KEY = process.env.SCREENSHOTRUN_API_KEY;
const API_URL = 'https://api.screenshotrun.com/v1/screenshots/capture';
const ARCHIVE_DIR = './archive';

async function archivePage(url, label) {
  const params = new URLSearchParams({
    url,
    full_page: 'true',
    format: 'png',
    width: '1280',
    height: '800',
    block_cookies: 'true',
    block_ads: 'true',
    delay: '3',
    cache_ttl: '0',
    extract_metadata: 'true',
  });

  const res = await fetch(`${API_URL}?${params}`, {
    headers: { 'Authorization': `Bearer ${API_KEY}` },
  });

  if (!res.ok) {
    console.error(`Failed to capture ${label}: ${res.status}`);
    return;
  }

  const date = new Date().toISOString().split('T')[0];
  const dir = path.join(ARCHIVE_DIR, label);
  fs.mkdirSync(dir, { recursive: true });

  const buffer = Buffer.from(await res.arrayBuffer());
  const filePath = path.join(dir, `${date}.png`);
  fs.writeFileSync(filePath, buffer);

  console.log(`Archived: ${label} -> ${filePath}`);
}

const targets = [
  { url: 'https://competitor.com/pricing', label: 'competitor-pricing' },
  { url: 'https://example.com/terms', label: 'terms-of-service' },
  { url: 'https://news-site.com/article/12345', label: 'news-article' },
  { url: 'https://yoursite.com', label: 'own-homepage' },
];

(async () => {
  for (const { url, label } of targets) {
    await archivePage(url, label);
  }
  console.log(`Archive run complete: ${new Date().toISOString()}`);
})();

Schedule it with cron — 0 6 * * * node archive.js runs daily at 6 AM. Each URL gets its own folder with date-stamped files: archive/competitor-pricing/2026-06-27.png. Over weeks and months, you build a visual timeline of every page. For larger URL lists, add a short delay between captures to stay within rate limits — the free tier allows 5 requests per minute, paid plans go higher. I wrote about handling errors in production pipelines separately — retries and graceful failures matter when your archive script runs unattended at 3 AM.

Python archiving script with metadata logging

If Python is your stack, the same pattern works with requests and Pillow. This version adds metadata logging — page title, capture timestamp, and a SHA-256 file hash — to a JSON file alongside each screenshot. Useful for compliance workflows where an audit trail matters as much as the image itself.

import os, json, hashlib, requests
from datetime import datetime

API_KEY = os.environ['SCREENSHOTRUN_API_KEY']
API_URL = 'https://api.screenshotrun.com/v1/screenshots/capture'
ARCHIVE_DIR = './archive'

def archive_page(url, label):
    params = {
        'url': url, 'full_page': 'true', 'format': 'png',
        'width': '1280', 'height': '800',
        'block_cookies': 'true', 'block_ads': 'true',
        'delay': '3', 'cache_ttl': '0',
        'extract_metadata': 'true',
    }

    res = requests.get(API_URL, params=params,
                       headers={'Authorization': f'Bearer {API_KEY}'})

    if res.status_code != 200:
        print(f'Failed: {label} ({res.status_code})')
        return

    date = datetime.now().strftime('%Y-%m-%d')
    folder = os.path.join(ARCHIVE_DIR, label)
    os.makedirs(folder, exist_ok=True)

    file_path = os.path.join(folder, f'{date}.png')
    with open(file_path, 'wb') as f:
        f.write(res.content)

    file_hash = hashlib.sha256(res.content).hexdigest()

    log_path = os.path.join(folder, 'log.json')
    log = json.load(open(log_path)) if os.path.exists(log_path) else []
    log.append({
        'date': date,
        'url': url,
        'file': f'{date}.png',
        'sha256': file_hash,
        'timestamp': datetime.now().isoformat(),
        'size_bytes': len(res.content),
    })
    with open(log_path, 'w') as f:
        json.dump(log, f, indent=2)

    print(f'Archived: {label} -> {file_path} ({file_hash[:16]}...)')

targets = [
    ('https://competitor.com/pricing', 'competitor-pricing'),
    ('https://example.com/terms', 'terms-of-service'),
    ('https://regulator.gov/guidance', 'regulatory-guidance'),
]

for url, label in targets:
    archive_page(url, label)

The SHA-256 hash logged alongside each capture creates a verifiable chain of custody. If someone questions whether an archive was tampered with, you can hash the file again and compare. Combined with the timestamp, this is the foundation of admissible digital evidence — not bulletproof on its own, but far stronger than a manual screenshot with no metadata trail.

Pricing: building an archive with a screenshot API

Archiving costs depend on how many URLs you capture and how often. Here's what different archiving workloads look like on screenshotrun versus alternatives:

Workload	screenshotrun	PageFreezer	Stillio	Self-hosted Puppeteer
10 URLs, daily	Free (200/mo covers 300)	~$99/mo (enterprise plans)	$29/mo	$10-20/mo server + your time
50 URLs, daily	$9/mo (3,000 included)	~$199/mo	$49/mo	$20-40/mo server + your time
200 URLs, daily	$29/mo (10,000 included)	Custom pricing	$99/mo	$40-80/mo server + your time
500 URLs, daily	$49/mo (25,000 included)	Custom pricing	Not available	Dedicated server + ops overhead

PageFreezer and enterprise archiving tools charge hundreds to thousands per month because they bundle storage, dashboards, compliance reporting, and support contracts. If you need all of that, they might be worth it. But if you're a developer who just needs the screenshots and can handle storage yourself, a website archiving API at $9-29/month does the capture job at a fraction of the cost. The free tier at 200 captures per month is enough to archive 6 URLs daily — plenty to validate your pipeline before committing to a paid plan.

Practical considerations for long-term archives

Archives grow. If you capture 50 full-page PNGs daily, each averaging 2-4 MB, that's 3-6 GB per month. Over a year, you're looking at 36-72 GB. That's cheap on S3 or Google Cloud Storage (under $2/month for standard storage), but you'll want a retention policy. Do you need daily captures from two years ago, or would monthly snapshots suffice for older records? A simple script that consolidates older captures — keeping the first of each month and deleting the rest — cuts storage by 95% for anything older than 90 days.

File naming matters more than you'd think. A flat folder with 10,000 files named screenshot-1.png through screenshot-10000.png is useless. Organize by target and date: archive/{label}/{YYYY-MM-DD}.png. The scripts above already follow this pattern. When someone asks "what did their pricing page look like in March?", you open a folder and grab the file.

For compliance use cases where archives may need to be presented as evidence, consider capturing in both PNG and PDF. The PNG gives you a pixel-perfect visual record for comparison. The PDF version preserves searchable text and is the format most commonly accepted in legal and regulatory proceedings. One extra API call per page doubles your evidence quality.

Start archiving websites automatically

Get your API key — 200 free screenshots/month

A website archiving API replaces the uncertainty of hoping the Wayback Machine crawls your target with the reliability of a scheduled capture you control. Whether you're building a compliance archive for regulatory audits, preserving evidence for legal proceedings, tracking competitor changes, or documenting your own site's history — the architecture is the same: an API call, a timestamp, a stored file. The viewport settings keep every capture consistent. Cookie blocking removes consent overlays. Full-page capture ensures nothing hides below the fold. And if the archive also needs to double as website thumbnails for a directory or dashboard, adding resize_width to the request gives you a display-ready version from the same capture. The pages you're watching today might not exist tomorrow. Archive them while they're still there.

Frequently asked questions

Send a GET request to the screenshot API with the target URL, set full_page to true and format to PNG or PDF, and save the response with a timestamped filename. Run the script on a cron schedule — daily, hourly, or whatever frequency you need. Each capture is independent: you send a URL, you get a screenshot. Storage is yours to manage — S3, Google Cloud Storage, or a local disk. The API handles the browser rendering; your script handles the scheduling and file management.

PDF for legal and compliance use cases — it preserves searchable text, embeds metadata, and is the standard format accepted in legal proceedings. PNG for visual comparison workflows where you need pixel-perfect fidelity to run automated diffs between captures. WebP for large-scale archives where storage cost matters more than pixel-level accuracy. For maximum coverage, capture in both PNG and PDF — one for machine comparison, one for human review.

It depends on how many URLs you archive and how often. Archiving 10 URLs daily costs about 300 captures per month — that fits inside screenshotrun's free tier (200/mo with some headroom on paid). Archiving 50 URLs daily costs ~1,500/month, which fits the $9/mo plan. Compare that to enterprise archiving tools like PageFreezer that start at $99/month or Stillio at $29/month. The API approach is cheaper because you bring your own storage and scheduling — you're paying only for the screenshots.

Screenshots are increasingly accepted as evidence in court, but their admissibility depends on demonstrating authenticity. Automated captures with consistent parameters, timestamped filenames, and SHA-256 file hashes create a verifiable chain of custody. This is stronger than a manual browser screenshot with no metadata. For maximum evidentiary value, capture in PDF format, log the hash and timestamp to a separate audit file, and store archives in a system with access controls. Some jurisdictions may require additional authentication — consult with legal counsel for your specific needs.

For targeted archiving of specific URLs, yes. The Wayback Machine crawls the web broadly but unpredictably — your target page might get captured weekly or might go months between snapshots. You can't control the schedule, the viewport, or the rendering parameters. Pages blocked by robots.txt are excluded entirely. A screenshot API captures exactly the URLs you specify, on your schedule, with consistent parameters. The Wayback Machine is excellent for broad historical preservation of the public web. A screenshot API is better when you need reliable, scheduled captures of specific pages for compliance, legal, or competitive intelligence.