How to Open Screaming Frog Crawls Directly in Python

Discover how to Open Screaming Frog Crawls Directly in Python with the screamingfrog Library.

How to Open Screaming Frog Crawls Directly in Python with the screamingfrog Library

Anyone who works in Screaming Frog on a regular basis knows the rhythm of an audit.

The crawl finishes, you open the GUI, click into one of the reports, export to CSV, then spend a few minutes cleaning columns before the actual analysis can begin.

Multiply this by titles, response codes, inlinks, canonicals, structured data, and the other recurring reports, and you have already burned a good portion of the afternoon on what is, fundamentally, preparatory work.

The screamingfrog Python library was built precisely to skip this step. It opens a .dbseospider file directly inside Python and lets you query the underlying crawl through the same logical units that the GUI presents to the user: pages, links, and the individual report tabs.

When the typed interface is not enough, the library also exposes raw SQL for those situations where the question is unusual. In my own workflow I have not exported a CSV from the GUI in several months, and I do not miss it.

In this guide we will go through a series of operations against a real crawl file: loading the .dbseospider, querying pages and links, pulling familiar tab reports by name, scoping queries to a section of the site, and finally falling back on raw SQL when the typed API is not enough.

The walkthrough closes with a short end-to-end example that ties the four interfaces together.
A clarification before we begin. The library is not a replacement for Screaming Frog itself, since the crawler remains the data source. It is a replacement for the export-clean-analyze loop that usually follows every crawl, which is a different proposition entirely.

What the screamingfrog library changes

Screaming Frog already stores the crawl in a database-backed format. The application reads from it, the GUI displays from it, and the exports come out of it. The screamingfrog Python library reads from that same crawl data and gives you a programmatic interface on top of it. Nothing is re-parsed from a CSV, and nothing is derived from the export pipeline.
There are four main interfaces for asking questions of the crawl:

  • pages() for sitewide page-level analysis
  • links() for sitewide link-level analysis
  • tab(…) for a specific Screaming Frog report or export
  • raw() and sql() for custom queries and edge cases

This tutorial focuses on the .dbseospider format because it is the cleanest starting point. It is a self-contained crawl database that the library can open without spinning up the GUI or running a new crawl, so if you already have a .dbseospider file from a previous audit, you can start working on it in Python immediately.

Install the library and get a crawl file ready

You need a Python environment and the screamingfrog package installed in it. The library is available on PyPI, so a regular pip install is the starting point in most setups.

If you plan to open .dbseospider files directly, make sure a Java runtime is available on your system as well. The library reads the underlying Derby crawl database, and that requires Java to be present.

You also need a .dbseospider file on disk. If you crawl in database storage mode, you already have one in your project folder. If you only have a .seospider file from a memory-only crawl, the library can convert it, but that is a slightly different workflow and we will not cover it here.
For the rest of this guide, assume there is a file named example-crawl.dbseospider sitting in the working directory of your script or notebook.

Load a crawl in Python

The entry point is Crawl.load(…). You give it a path, and you get back a Crawl object that holds the open connection to the crawl data.

That, somewhat anticlimactically, is the entire setup. The crawl is now open and ready to be queried.
If you want a quick look at what is inside the file before writing any real queries, the attribute crawl.tabs lists the tabs that the backend knows about.

There is one distinction worth noting at this stage. You are not loading an export. You are loading the crawl itself. Every query runs against the crawl data rather than a CSV export, with the .dbseospider file remaining the source of truth.

Query pages sitewide with pages()

For the majority of audit questions that begin with “which pages…”, the answer lives in pages(). It is the sitewide page view of the crawl, and in spirit it corresponds to the Internal All tab in the GUI, exposed here as a chainable query interface.
The simplest invocation collects everything in a single call:

The method .collect() materializes the result as a list of dictionaries. Each row is a page, with its tracked columns attached.
In day-to-day work you will almost never want the full unfiltered result. The filter is normally applied first:

The call .filter(status_code=404) is the most common shape you will end up writing, and it is the easiest entry point into the API. The keyword arguments correspond to normalized Screaming Frog field names. The same style works for values like indexability=”Non-Indexable” and other recurring filters.
When the rows still carry too many columns to read at a glance, narrow them with .select(…):

This is the pattern you will reach for most often. Select the columns you actually need, filter to the rows that matter, and then call collect(). The same chain handles 404 lists, non-indexable page audits, and title QA without any structural change.
If the question starts with “which pages…”, you should start with pages().

Query the link graph with links()

Page-level data and link-level data are not interchangeable, and they answer different audit questions. A URL returning 404 is one piece of information. The set of internal pages still linking to that URL is the piece of information that actually allows the fix to be deployed. The function links() is the interface for the second category.

The call links(“in”) returns the inlink view, that is, what links point at a given destination. The call links(“out”) returns the outlink view, that is, what a given source page links to. The default direction is “out”.

A good place to start is to look at what a single row actually contains:

The most useful audit question that follows on top of this view is broken inlinks. Which source pages still link to URLs that now return a 4xx?

What this returns is a per-link list, not a per-destination list. A single broken URL with eight internal references will appear as eight separate rows. That is usually what you want when the deliverable is “go and fix the eight places that still link here,” because the unit of remediation is the source page, not the destination.

Another common workflow is nofollow inlinks. This is useful when you suspect internal links are wasting equity, or when a template was updated incorrectly and is now emitting nofollow attributes where it should not:

If the question starts with “which links…”, you should start with links().

Access familiar Screaming Frog reports with tab()

If you have used Screaming Frog for any length of time, you already think about your data in terms of reports. Page Titles. Response Codes. Internal All. The function tab(…) is the bridge between that mental model and the code.
You pass the tab name as it appears in the export structure, and you get a chainable view of its rows in return.

The filters that you see in the GUI dropdown for a given tab are exposed as a GUI filter:

Response codes work in the same way:

If you do not remember which columns a particular tab carries, the helper tab_columns(…) gives you the answer without making you guess:

A note on when to reach for tab(…) versus pages(). The function pages() is usually a better choice for general “give me a list of pages with property X” questions, because it operates on the unified page view.

The function tab(…) is the better choice when you already know the specific report you want, when you want the GUI-style filter labels available to you, or when the question is naturally framed around a particular Screaming Frog tab name.

Scope a workflow to a section

Not every audit question is sitewide. A lot of real work is “what is going on in /blog” or “show me the products folder only.” The function section(…) scopes any of the views above to a URL prefix or a path prefix.

It composes with links() in the same way:

It is helpful to treat section(…) as a scoping primitive, and not as a separate API surface. Anything you can do sitewide, you can do scoped to a folder by chaining off section(…) first.

Use raw() and sql() when you need more control

Eventually, a question will appear that the typed views do not answer cleanly. The filter you want is not exposed as a keyword argument. You want to inspect the backend tables directly. You need a precise WHERE clause that the typed API has not foreseen. That is what raw() and sql() exist for.
The function raw(…) returns rows from a backend table directly. It is useful for inspecting a table when you do not yet know its shape:

The function sql(…) runs an arbitrary SQL string against the same backend. Parameters are passed positionally as a list:

If you prefer SQL semantics but would rather not write the full SELECT by hand, query(…) exposes the backend through a chainable builder:

A small warning here. The SQL interface is the most flexible of the four, which is exactly what makes it tempting to start with. I would suggest you do not.

The typed views carry column normalization, GUI-style filters, and per-tab logic that you would otherwise have to rebuild yourself in raw SQL. Start at pages() or tab(…). Drop lower only when something genuinely does not fit.
Treat raw SQL as an escape hatch, not as the default path.

Build a simple no-CSV workflow

Most of what an audit needs from day to day is some combination of a page-level issue list, the link impact of those issues, a familiar tab report, and the occasional custom query. All four happen against the same loaded crawl, in a single process, without any export step in between.

It helps to read the example top to bottom as a pattern, and not as a finished script. Step one is the page-level issue list. Step two is the link impact, so that the fix has a destination, not just a problem. Step three pulls a tab that the team already trusts, which keeps the output legible to anyone who reviews it later. Step four is the optional escape hatch, used only because the question got specific enough that the typed views would not have been faster.
In a CSV-driven workflow, this would be at least four exports, four cleanups, and a manual stitch at the end. Here, it is one file open and four queries running against the same crawl.

Decision framework: which interface should you use?

The four interfaces overlap, and the overlap is intentional. Knowing which one to reach for first is the single thing that saves the most time once you start writing real scripts.

  • pages() is for page-level questions: 404 lists, noindex audits, title QA, and anything shaped like “which pages have property X”.
  • links() is for link-level questions: broken inlinks, nofollow inlinks, outgoing link checks, and anchor text reviews.
  • tab(…) is for specific Screaming Frog reports or exports: page titles, response codes, internal_all, and anything else you already know by its GUI tab name.
  • raw() and sql() are for advanced or unusual queries: custom filtering, raw APP tables, joins, and edge cases that the typed API does not cover cleanly.

The rule of thumb is to start as high-level as possible, and to drop lower only when the question genuinely demands it.

Common mistakes to avoid

A few patterns show up over and over again with people who are coming from the CSV workflow.

  • Starting with SQL too early. The typed views exist for a reason. Use them first, and drop down only when they actually fall short.
  • Skipping .select(…). Pulling every column when you only need four makes the output harder to read.
  • Defaulting to tab(…) for everything. If your question is page-level, pages() is usually shorter and cleaner.
  • Treating the library as a CSV replacement. The point is direct access to the crawl, not a faster path to a spreadsheet.
  • Ignoring link-level workflows. Page-level data tells you what is broken. Link-level data tells you what to actually fix.

Conclusion and next step

The real value of this library is not the Python API itself. The real value is that you stop running the export-clean-analyze loop entirely. You open a .dbseospider once, ask page-level and link-level questions against it, pull the same reports you would have pulled in the GUI, and drop into raw SQL only when the question genuinely needs it.
You should now be comfortable doing four things directly against a crawl file:

  • Opening a .dbseospider with Crawl.load(…)
  • Querying pages with pages(), filtered and selected down to the columns you actually care about
  • Querying the link graph with links(“in”) or links(“out”)
  • Falling back on tab(…), raw(), or sql() when the question gets specific

In the next tutorial, we will use this same workflow to do something more technical: finding broken inlinks, locating orphan pages, and tracing redirect chains directly from the crawl, without exporting a single CSV file.

ANTONIO ATILIO MACULUS

Hi! My name is Antonio, based in Buenos Aires. I’m Director of AI Content Strategy at Better Collective and the founder of AuditLabs, where I host free SEO tools alongside my consulting work.

AuditLabs founder

Seo Spider Tab