Website Archiving API — Scheduled Captures You Control
Websites disappear. Pages get edited, redirected, or taken offline without warning. A competitor rewrites their pricing page and you have no record of what it said yesterday. A regulatory body updates their guidance and the previous version vanishes. A news article gets retracted, a product listing gets pulled, a terms-of-service page loses a liability clause — and if you didn't save it, it's gone. The Wayback Machine captures some of the internet some of the time, but you can't control what it crawls, when it crawls, or whether it crawls your target at all.
If you need reliable, scheduled snapshots of specific web pages — not a random crawl that might cover your URL next month — you need a website archiving API that runs on your schedule and captures exactly what you point it at. One API call per page, a cron job for the schedule, and a storage bucket for the results. That's the entire architecture.
Why the Wayback Machine isn't enough for serious archiving
The Internet Archive does incredible work preserving the public web. But it was built for broad historical preservation, not for targeted, reliable archiving of specific URLs. Its crawl schedule is unpredictable — some pages get captured weekly, others go months or years between snapshots. Pages behind robots.txt directives are excluded entirely. And since early 2026, several major news publishers have restricted Wayback Machine access over AI scraping concerns, shrinking coverage further.
For legal evidence, compliance records, or competitive intelligence, "maybe it got crawled last month" isn't good enough. You need captures at known intervals — every hour, every day, every week — with consistent rendering parameters so you can compare screenshots over time. You need to control the viewport, the format, and what gets blocked (cookie banners, chat widgets, ad rotations). The Wayback Machine gives you none of that.
A website archiving API flips the model. As a Wayback Machine alternative that you actually control, it lets you tell the API exactly which URL to capture, exactly when, and exactly how. The result is a pixel-perfect screenshot or PDF that you store wherever you want — S3, Google Cloud Storage, your own server. You own the archive.
Capturing a page for archiving
An archiving capture needs to be thorough and consistent. Full-page mode catches everything below the fold. PNG format preserves pixel-perfect fidelity. Cookie and ad blocking removes elements that change on every load and aren't part of the actual page content. Here's a capture request configured for archiving:
curl -G "https://api.screenshotrun.com/v1/screenshots/capture" \
--data-urlencode "url=https://example.com/terms-of-service" \
-d "full_page=true" \
-d "format=png" \
-d "width=1280" \
-d "height=800" \
-d "block_cookies=true" \
-d "block_ads=true" \
-d "delay=3" \
-d "cache_ttl=0" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output "archive-2026-06-27.png"
The cache_ttl=0 parameter is critical for archiving — it forces a fresh render every time. Without it, you might get a cached version from a previous request, which defeats the purpose of capturing the page's current state. The delay=3 gives JavaScript-heavy pages time to finish rendering. If the page loads content dynamically, swap delay for wait_for_selector with a CSS selector that appears only after the real content loads.
API parameters for reliable archiving
| Parameter | Recommended value | Why it matters for archiving |
|---|---|---|
full_page | true | Captures the entire page, not just the visible viewport. Legal clauses, pricing footnotes, and policy details often sit below the fold. Full-page capture ensures nothing gets cut off. |
format | png or pdf | PNG for pixel-perfect visual records. PDF for text-searchable documents with metadata. Choose based on whether you need to compare images over time (PNG) or produce printable evidence (PDF). |
width | 1280 | Standard desktop viewport. Keep it consistent across all captures so archives are comparable. |
block_cookies | true | Removes GDPR/cookie consent banners that obscure page content. Cookie blocking ensures the archive shows the actual page, not a modal overlay. |
block_ads | true | Ads rotate on every load. They're not part of the page content you're archiving — they're noise. |
cache_ttl | 0 | Forces a fresh capture. Cached results would archive stale content. |
delay | 2-3 | Gives client-side JavaScript time to render. Pages with lazy loading or API-driven content need this breathing room. |
extract_metadata | true | Returns page title, description, and other meta tags alongside the screenshot. Useful for cataloging archives. |
For pages that load content from APIs or render as single-page apps, a fixed delay isn't always reliable. The wait_for_selector parameter lets you wait until a specific element appears in the DOM — a data table, a pricing grid, a terms section. That guarantees the content you're archiving has actually loaded before the capture fires.
Who archives websites and why
Legal teams capture web pages as evidence for litigation. Trademark infringement, defamation, contract disputes — if the opposing party can edit or delete the page before trial, the screenshot is all you have. Courts accept properly timestamped screenshots as evidence, but the burden of proof falls on demonstrating that the capture is authentic and unaltered. Regular, automated captures with consistent parameters strengthen that case significantly compared to a manual screenshot taken once in a browser.
Compliance departments in regulated industries need records of what their own websites displayed at specific points in time. Financial services firms under SEC and FINRA rules must retain copies of public-facing communications, including web pages. Healthcare organizations subject to HIPAA need to document what patient-facing information was published. Insurance companies need records of policy terms as they appeared to customers. An automated website archiving pipeline running on a daily schedule builds these records without manual intervention.
Competitive intelligence teams track competitor websites for pricing changes, feature announcements, and messaging shifts. If you're monitoring 20 competitor pages and a pricing page changes overnight, you want last night's version on file. This overlaps with website change monitoring — the difference is that monitoring focuses on detecting that something changed, while archiving focuses on preserving what the page looked like at a specific moment.
Researchers and journalists archive web content that might be edited or removed. News articles get retracted, social media posts get deleted, government pages get revised without changelogs. A systematic archive preserves the record even if the original source changes or disappears.
Product teams archive their own sites before and after major releases. If a deploy breaks the marketing site or introduces unintended copy changes, having a pre-deploy archive lets you see exactly what changed. This pairs naturally with visual regression testing — the testing catches bugs in CI, while the archive preserves the production state for auditing.
Choosing the right format: PNG vs PDF vs WebP
| Format | Best for | Trade-off |
|---|---|---|
| PNG | Visual comparison, pixel-level diffing, monitoring pipelines | Larger files, but lossless — every pixel is preserved exactly. Essential if you plan to run automated comparisons between archive snapshots. |
| Legal evidence, compliance records, printable documents | Text remains searchable and selectable. Widely accepted in legal proceedings. Use pdf_page_format=a4 for standard document sizing. PDF capture produces print-ready output. | |
| WebP | Long-term storage of large archives where file size matters | 70-80% smaller than PNG with near-lossless quality. Good for archives where you need visual reference but don't plan pixel-level comparison. Not ideal for legal evidence where lossless fidelity matters. |
For most archiving workflows, PDF is the strongest choice. It preserves the visual layout, keeps text searchable, embeds metadata, and is universally accepted in legal and compliance contexts. If you also need to run visual comparisons between archive snapshots, capture in both PNG and PDF — one for machine comparison, one for human review and recordkeeping. I covered the full breakdown of format options and quality settings on the feature page — the short version is that PNG and PDF are the only formats that make sense for archiving where fidelity matters.
Automating an archive pipeline in Node.js
This script captures a list of URLs and saves each screenshot with a timestamped filename. Run it on a cron schedule — daily, weekly, or whatever frequency your archiving requirements demand.
const fs = require('fs');
const path = require('path');
const API_KEY = process.env.SCREENSHOTRUN_API_KEY;
const API_URL = 'https://api.screenshotrun.com/v1/screenshots/capture';
const ARCHIVE_DIR = './archive';
async function archivePage(url, label) {
const params = new URLSearchParams({
url,
full_page: 'true',
format: 'png',
width: '1280',
height: '800',
block_cookies: 'true',
block_ads: 'true',
delay: '3',
cache_ttl: '0',
extract_metadata: 'true',
});
const res = await fetch(`${API_URL}?${params}`, {
headers: { 'Authorization': `Bearer ${API_KEY}` },
});
if (!res.ok) {
console.error(`Failed to capture ${label}: ${res.status}`);
return;
}
const date = new Date().toISOString().split('T')[0];
const dir = path.join(ARCHIVE_DIR, label);
fs.mkdirSync(dir, { recursive: true });
const buffer = Buffer.from(await res.arrayBuffer());
const filePath = path.join(dir, `${date}.png`);
fs.writeFileSync(filePath, buffer);
console.log(`Archived: ${label} -> ${filePath}`);
}
const targets = [
{ url: 'https://competitor.com/pricing', label: 'competitor-pricing' },
{ url: 'https://example.com/terms', label: 'terms-of-service' },
{ url: 'https://news-site.com/article/12345', label: 'news-article' },
{ url: 'https://yoursite.com', label: 'own-homepage' },
];
(async () => {
for (const { url, label } of targets) {
await archivePage(url, label);
}
console.log(`Archive run complete: ${new Date().toISOString()}`);
})();
Schedule it with cron — 0 6 * * * node archive.js runs daily at 6 AM. Each URL gets its own folder with date-stamped files: archive/competitor-pricing/2026-06-27.png. Over weeks and months, you build a visual timeline of every page. For larger URL lists, add a short delay between captures to stay within rate limits — the free tier allows 5 requests per minute, paid plans go higher. I wrote about handling errors in production pipelines separately — retries and graceful failures matter when your archive script runs unattended at 3 AM.
Python archiving script with metadata logging
If Python is your stack, the same pattern works with requests and Pillow. This version adds metadata logging — page title, capture timestamp, and a SHA-256 file hash — to a JSON file alongside each screenshot. Useful for compliance workflows where an audit trail matters as much as the image itself.
import os, json, hashlib, requests
from datetime import datetime
API_KEY = os.environ['SCREENSHOTRUN_API_KEY']
API_URL = 'https://api.screenshotrun.com/v1/screenshots/capture'
ARCHIVE_DIR = './archive'
def archive_page(url, label):
params = {
'url': url, 'full_page': 'true', 'format': 'png',
'width': '1280', 'height': '800',
'block_cookies': 'true', 'block_ads': 'true',
'delay': '3', 'cache_ttl': '0',
'extract_metadata': 'true',
}
res = requests.get(API_URL, params=params,
headers={'Authorization': f'Bearer {API_KEY}'})
if res.status_code != 200:
print(f'Failed: {label} ({res.status_code})')
return
date = datetime.now().strftime('%Y-%m-%d')
folder = os.path.join(ARCHIVE_DIR, label)
os.makedirs(folder, exist_ok=True)
file_path = os.path.join(folder, f'{date}.png')
with open(file_path, 'wb') as f:
f.write(res.content)
file_hash = hashlib.sha256(res.content).hexdigest()
log_path = os.path.join(folder, 'log.json')
log = json.load(open(log_path)) if os.path.exists(log_path) else []
log.append({
'date': date,
'url': url,
'file': f'{date}.png',
'sha256': file_hash,
'timestamp': datetime.now().isoformat(),
'size_bytes': len(res.content),
})
with open(log_path, 'w') as f:
json.dump(log, f, indent=2)
print(f'Archived: {label} -> {file_path} ({file_hash[:16]}...)')
targets = [
('https://competitor.com/pricing', 'competitor-pricing'),
('https://example.com/terms', 'terms-of-service'),
('https://regulator.gov/guidance', 'regulatory-guidance'),
]
for url, label in targets:
archive_page(url, label)
The SHA-256 hash logged alongside each capture creates a verifiable chain of custody. If someone questions whether an archive was tampered with, you can hash the file again and compare. Combined with the timestamp, this is the foundation of admissible digital evidence — not bulletproof on its own, but far stronger than a manual screenshot with no metadata trail.
Pricing: building an archive with a screenshot API
Archiving costs depend on how many URLs you capture and how often. Here's what different archiving workloads look like on screenshotrun versus alternatives:
| Workload | screenshotrun | PageFreezer | Stillio | Self-hosted Puppeteer |
|---|---|---|---|---|
| 10 URLs, daily | Free (200/mo covers 300) | ~$99/mo (enterprise plans) | $29/mo | $10-20/mo server + your time |
| 50 URLs, daily | $9/mo (3,000 included) | ~$199/mo | $49/mo | $20-40/mo server + your time |
| 200 URLs, daily | $29/mo (10,000 included) | Custom pricing | $99/mo | $40-80/mo server + your time |
| 500 URLs, daily | $49/mo (25,000 included) | Custom pricing | Not available | Dedicated server + ops overhead |
PageFreezer and enterprise archiving tools charge hundreds to thousands per month because they bundle storage, dashboards, compliance reporting, and support contracts. If you need all of that, they might be worth it. But if you're a developer who just needs the screenshots and can handle storage yourself, a website archiving API at $9-29/month does the capture job at a fraction of the cost. The free tier at 200 captures per month is enough to archive 6 URLs daily — plenty to validate your pipeline before committing to a paid plan.
Practical considerations for long-term archives
Archives grow. If you capture 50 full-page PNGs daily, each averaging 2-4 MB, that's 3-6 GB per month. Over a year, you're looking at 36-72 GB. That's cheap on S3 or Google Cloud Storage (under $2/month for standard storage), but you'll want a retention policy. Do you need daily captures from two years ago, or would monthly snapshots suffice for older records? A simple script that consolidates older captures — keeping the first of each month and deleting the rest — cuts storage by 95% for anything older than 90 days.
File naming matters more than you'd think. A flat folder with 10,000 files named screenshot-1.png through screenshot-10000.png is useless. Organize by target and date: archive/{label}/{YYYY-MM-DD}.png. The scripts above already follow this pattern. When someone asks "what did their pricing page look like in March?", you open a folder and grab the file.
For compliance use cases where archives may need to be presented as evidence, consider capturing in both PNG and PDF. The PNG gives you a pixel-perfect visual record for comparison. The PDF version preserves searchable text and is the format most commonly accepted in legal and regulatory proceedings. One extra API call per page doubles your evidence quality.
Start archiving websites automatically
Get your API key — 200 free screenshots/monthA website archiving API replaces the uncertainty of hoping the Wayback Machine crawls your target with the reliability of a scheduled capture you control. Whether you're building a compliance archive for regulatory audits, preserving evidence for legal proceedings, tracking competitor changes, or documenting your own site's history — the architecture is the same: an API call, a timestamp, a stored file. The viewport settings keep every capture consistent. Cookie blocking removes consent overlays. Full-page capture ensures nothing hides below the fold. And if the archive also needs to double as website thumbnails for a directory or dashboard, adding resize_width to the request gives you a display-ready version from the same capture. The pages you're watching today might not exist tomorrow. Archive them while they're still there.