Core API
This section documents the Scrapy core API, and it’s intended for developers of extensions and middlewares.
Crawler API
The main entry point to the Scrapy API is the Crawler
object, which components can get for
initialization. It provides access to all Scrapy core
components, and it is the only way for components to access them and hook their
functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installed
extensions and it’s configured through the EXTENSIONS setting which
contains a dictionary of all available extensions and their order similar to
how you configure the downloader middlewares.
Settings API
- scrapy.settings.SETTINGS_PRIORITIES
Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the
Settingsclass.SETTINGS_PRIORITIES = { "default": 0, "command": 10, "addon": 15, "project": 20, "spider": 30, "cmdline": 40, }
For a detailed explanation on each settings sources, see: Settings.
SpiderLoader API
- class scrapy.spiderloader.SpiderLoader
This class is in charge of retrieving and handling the spider classes defined across the project.
Custom spider loaders can be employed by specifying their path in the
SPIDER_LOADER_CLASSproject setting. They must fully implement thescrapy.interfaces.ISpiderLoaderinterface to guarantee an errorless execution.- from_settings(settings)
This class method is used by Scrapy to create an instance of the class. It’s called with the current project settings, and it loads the spiders found recursively in the modules of the
SPIDER_MODULESsetting.- Parameters:
settings (
Settingsinstance) – project settings
- load(spider_name)
Get the Spider class with the given name. It’ll look into the previously loaded spiders for a spider class with name
spider_nameand will raise a KeyError if not found.- Parameters:
spider_name (str) – spider class name
- list()
Get the names of the available spiders in the project.
- find_by_request(request)
List the spiders’ names that can handle the given request. Will try to match the request’s url against the domains of the spiders.
- Parameters:
request (
Requestinstance) – queried request
Signals API
Stats Collector API
There are several Stats Collectors available under the
scrapy.statscollectors module and they all implement the Stats
Collector API defined by the StatsCollector
class (which they all inherit from).
- class scrapy.statscollectors.StatsCollector
- get_value(key, default=None)
Return the value for the given stats key or default if it doesn’t exist.
- get_stats()
Get all stats from the currently running spider as a dict.
- set_value(key, value)
Set the given value for the given stats key.
- set_stats(stats)
Override the current stats with the dict passed in
statsargument.
- inc_value(key, count=1, start=0)
Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not set).
- max_value(key, value)
Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.
- min_value(key, value)
Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.
- clear_stats()
Clear all stats.
The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:
- open_spider()
Open the spider for stats collection.
- close_spider()
Close the spider. After this is called, no more specific stats can be accessed or collected.