.. _topics-index:

==============================
Scrapy |version| documentation
==============================

Scrapy is a fast high-level `web crawling`_ and `web scraping`_ framework, used
to crawl websites and extract structured data from their pages. It can be used
for a wide range of purposes, from data mining to monitoring and automated
testing.

.. _web crawling: https://en.wikipedia.org/wiki/Web_crawler
.. _web scraping: https://en.wikipedia.org/wiki/Web_scraping

.. _getting-help:

Getting help
============

Having trouble? We'd like to help!

* Try the :doc:`FAQ <faq>` -- it's got answers to some common questions.
* Looking for specific information? Try the :ref:`genindex` or :ref:`modindex`.
* Ask or search questions in `StackOverflow using the scrapy tag`_.
* Ask or search questions in the `Scrapy subreddit`_.
* Search for questions on the archives of the `scrapy-users mailing list`_.
* Ask a question in the `#scrapy IRC channel`_,
* Report bugs with Scrapy in our `issue tracker`_.
* Join the Discord community `Scrapy Discord`_.

.. _scrapy-users mailing list: https://groups.google.com/forum/#!forum/scrapy-users
.. _Scrapy subreddit: https://www.reddit.com/r/scrapy/
.. _StackOverflow using the scrapy tag: https://stackoverflow.com/tags/scrapy
.. _#scrapy IRC channel: irc://irc.freenode.net/scrapy
.. _issue tracker: https://github.com/scrapy/scrapy/issues
.. _Scrapy Discord: https://discord.com/invite/mv3yErfpvq

First steps
===========

.. toctree::
   :caption: First steps
   :hidden:

   intro/overview
   intro/install
   intro/tutorial
   intro/examples

:doc:`intro/overview`
    Understand what Scrapy is and how it can help you.

:doc:`intro/install`
    Get Scrapy installed on your computer.

:doc:`intro/tutorial`
    Write your first Scrapy project.

:doc:`intro/examples`
    Learn more by playing with a pre-made Scrapy project.

.. _section-basics:

Basic concepts
==============

.. toctree::
   :caption: Basic concepts
   :hidden:

   topics/commands
   topics/spiders
   topics/selectors
   topics/items
   topics/loaders
   topics/shell
   topics/item-pipeline
   topics/feed-exports
   topics/request-response
   topics/link-extractors
   topics/settings
   topics/exceptions

:doc:`topics/commands`
    Learn about the command-line tool used to manage your Scrapy project.

:doc:`topics/spiders`
    Write the rules to crawl your websites.

:doc:`topics/selectors`
    Extract the data from web pages using XPath.

:doc:`topics/shell`
    Test your extraction code in an interactive environment.

:doc:`topics/items`
    Define the data you want to scrape.

:doc:`topics/loaders`
    Populate your items with the extracted data.

:doc:`topics/item-pipeline`
    Post-process and store your scraped data.

:doc:`topics/feed-exports`
    Output your scraped data using different formats and storages.

:doc:`topics/request-response`
    Understand the classes used to represent HTTP requests and responses.

:doc:`topics/link-extractors`
    Convenient classes to extract links to follow from pages.

:doc:`topics/settings`
    Learn how to configure Scrapy and see all :ref:`available settings <topics-settings-ref>`.

:doc:`topics/exceptions`
    See all available exceptions and their meaning.

Built-in services
=================

.. toctree::
   :caption: Built-in services
   :hidden:

   topics/logging
   topics/stats
   topics/telnetconsole

:doc:`topics/logging`
    Learn how to use Python's built-in logging on Scrapy.

:doc:`topics/stats`
    Collect statistics about your scraping crawler.

:doc:`topics/telnetconsole`
    Inspect a running crawler using a built-in Python console.

Solving specific problems
=========================

.. toctree::
   :caption: Solving specific problems
   :hidden:

   faq
   topics/debug
   topics/contracts
   topics/practices
   topics/broad-crawls
   topics/developer-tools
   topics/dynamic-content
   topics/leaks
   topics/media-pipeline
   topics/deploy
   topics/autothrottle
   topics/benchmarking
   topics/jobs
   topics/coroutines
   topics/asyncio

:doc:`faq`
    Get answers to most frequently asked questions.

:doc:`topics/debug`
    Learn how to debug common problems of your Scrapy spider.

:doc:`topics/contracts`
    Learn how to use contracts for testing your spiders.

:doc:`topics/practices`
    Get familiar with some Scrapy common practices.

:doc:`topics/broad-crawls`
    Tune Scrapy for crawling a lot domains in parallel.

:doc:`topics/developer-tools`
    Learn how to scrape with your browser's developer tools.

:doc:`topics/dynamic-content`
    Read webpage data that is loaded dynamically.

:doc:`topics/leaks`
    Learn how to find and get rid of memory leaks in your crawler.

:doc:`topics/media-pipeline`
    Download files and/or images associated with your scraped items.

:doc:`topics/deploy`
    Deploying your Scrapy spiders and run them in a remote server.

:doc:`topics/autothrottle`
    Adjust crawl rate dynamically based on load.

:doc:`topics/benchmarking`
    Check how Scrapy performs on your hardware.

:doc:`topics/jobs`
    Learn how to pause and resume crawls for large spiders.

:doc:`topics/coroutines`
    Use the :ref:`coroutine syntax <async>`.

:doc:`topics/asyncio`
    Use :mod:`asyncio` and :mod:`asyncio`-powered libraries.

.. _extending-scrapy:

Extending Scrapy
================

.. toctree::
   :caption: Extending Scrapy
   :hidden:

   topics/architecture
   topics/addons
   topics/downloader-middleware
   topics/spider-middleware
   topics/extensions
   topics/signals
   topics/scheduler
   topics/exporters
   topics/download-handlers
   topics/components
   topics/api

:doc:`topics/architecture`
    Understand the Scrapy architecture.

:doc:`topics/addons`
    Enable and configure third-party extensions.

:doc:`topics/downloader-middleware`
    Customize how pages get requested and downloaded.

:doc:`topics/spider-middleware`
    Customize the input and output of your spiders.

:doc:`topics/extensions`
    Extend Scrapy with your custom functionality

:doc:`topics/signals`
    See all available signals and how to work with them.

:doc:`topics/scheduler`
    Understand the scheduler component.

:doc:`topics/exporters`
    Quickly export your scraped items to a file (XML, CSV, etc).

:doc:`topics/download-handlers`
    Customize how requests are downloaded or add support for new URL schemes.

:doc:`topics/components`
    Learn the common API and some good practices when building custom Scrapy
    components.

:doc:`topics/api`
    Use it on extensions and middlewares to extend Scrapy functionality.

All the rest
============

.. toctree::
   :caption: All the rest
   :hidden:

   news
   contributing
   versioning

:doc:`news`
    See what has changed in recent Scrapy versions.

:doc:`contributing`
    Learn how to contribute to the Scrapy project.

:doc:`versioning`
    Understand Scrapy versioning and API stability.


.. _intro-overview:

==================
Scrapy at a glance
==================

Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting
structured data which can be used for a wide range of useful applications, like
data mining, information processing or historical archival.

Even though Scrapy was originally designed for `web scraping`_, it can also be
used to extract data using APIs (such as `Amazon Associates Web Services`_) or
as a general purpose web crawler.

Walk-through of an example spider
=================================

In order to show you what Scrapy brings to the table, we'll walk you through an
example of a Scrapy Spider using the simplest way to run a spider.

Here's the code for a spider that scrapes famous quotes from website
https://quotes.toscrape.com, following the pagination:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/tag/humor/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "author": quote.xpath("span/small/text()").get(),
                    "text": quote.css("span.text::text").get(),
                }

            next_page = response.css('li.next a::attr("href")').get()
            if next_page is not None:
                yield response.follow(next_page, self.parse)

Put this in a text file, name it something like ``quotes_spider.py``
and run the spider using the :command:`runspider` command::

    scrapy runspider quotes_spider.py -o quotes.jsonl

When this finishes you will have in the ``quotes.jsonl`` file a list of the
quotes in JSON Lines format, containing the text and author, which will look like this::

    {"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
    {"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
    {"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
    ...

What just happened?
-------------------

When you ran the command ``scrapy runspider quotes_spider.py``, Scrapy looked for a
Spider definition inside it and ran it through its crawler engine.

The crawl started by making requests to the URLs defined in the ``start_urls``
attribute (in this case, only the URL for quotes in the *humor* category)
and called the default callback method ``parse``, passing the response object as
an argument. In the ``parse`` callback, we loop through the quote elements
using a CSS Selector, yield a Python dict with the extracted quote text and author,
look for a link to the next page and schedule another request using the same
``parse`` method as callback.

Here you will notice one of the main advantages of Scrapy: requests are
:ref:`scheduled and processed asynchronously <topics-architecture>`.  This
means that Scrapy doesn't need to wait for a request to be finished and
processed, it can send another request or do other things in the meantime. This
also means that other requests can keep going even if a request fails or an
error happens while handling it.

While this enables you to do very fast crawls (sending multiple concurrent
requests at the same time, in a fault-tolerant way) Scrapy also gives you
control over the politeness of the crawl through :ref:`a few settings
<topics-settings-ref>`. You can do things like setting a download delay between
each request, limiting the amount of concurrent requests per domain or per IP, and
even :ref:`using an auto-throttling extension <topics-autothrottle>` that tries
to figure these settings out automatically.

.. note::

    This is using :ref:`feed exports <topics-feed-exports>` to generate the
    JSON file, you can easily change the export format (XML or CSV, for example) or the
    storage backend (FTP or `Amazon S3`_, for example).  You can also write an
    :ref:`item pipeline <topics-item-pipeline>` to store the items in a database.

.. _topics-whatelse:

What else?
==========

You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
scraping easy and efficient, such as:

* Built-in support for :ref:`selecting and extracting <topics-selectors>` data
  from HTML/XML sources using extended CSS selectors and XPath expressions,
  with helper methods for extraction using regular expressions.

* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
  out the CSS and XPath expressions to scrape data, which is very useful when writing or
  debugging your spiders.

* Built-in support for :ref:`generating feed exports <topics-feed-exports>` in
  multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
  S3, local filesystem)

* Robust encoding support and auto-detection, for dealing with foreign,
  non-standard and broken encoding declarations.

* :ref:`Strong extensibility support <extending-scrapy>`, allowing you to plug
  in your own functionality using :ref:`signals <topics-signals>` and a
  well-defined API (middlewares, :ref:`extensions <topics-extensions>`, and
  :ref:`pipelines <topics-item-pipeline>`).

* A wide range of built-in extensions and middlewares for handling:

  - cookies and session handling
  - HTTP features like compression, authentication, caching
  - user-agent spoofing
  - robots.txt
  - crawl depth restriction
  - and more

* A :ref:`Telnet console <topics-telnetconsole>` for hooking into a Python
  console running inside your Scrapy process, to introspect and debug your
  crawler

* Plus other goodies like reusable spiders to crawl sites from `Sitemaps`_ and
  XML/CSV feeds, a media pipeline for :ref:`automatically downloading images
  <topics-media-pipeline>` (or any other media) associated with the scraped
  items, a caching DNS resolver, and much more!

What's next?
============

The next steps for you are to :ref:`install Scrapy <intro-install>`,
:ref:`follow through the tutorial <intro-tutorial>` to learn how to create
a full-blown Scrapy project and `join the community`_. Thanks for your
interest!

.. _join the community: https://www.scrapy.org/community
.. _web scraping: https://en.wikipedia.org/wiki/Web_scraping
.. _Amazon Associates Web Services: https://affiliate-program.amazon.com/welcome/ecs
.. _Amazon S3: https://aws.amazon.com/s3/
.. _Sitemaps: https://www.sitemaps.org/index.html


.. _intro-examples:

========
Examples
========

The best way to learn is with examples, and Scrapy is no exception. For this
reason, there is an example Scrapy project named quotesbot_, that you can use to
play and learn more about Scrapy. It contains two spiders for
https://quotes.toscrape.com, one using CSS selectors and another one using XPath
expressions.

The quotesbot_ project is available at: https://github.com/scrapy/quotesbot.
You can find more information about it in the project's README.

If you're familiar with git, you can checkout the code. Otherwise you can
download the project as a zip file by clicking
`here <https://github.com/scrapy/quotesbot/archive/master.zip>`_.

.. _quotesbot: https://github.com/scrapy/quotesbot


.. _intro-install:

==================
Installation guide
==================

.. _faq-python-versions:

Supported Python versions
=========================

Scrapy requires Python 3.10+, either the CPython implementation (default) or
the PyPy implementation (see :ref:`python:implementations`).

.. _intro-install-scrapy:

Installing Scrapy
=================

If you're using `Anaconda`_ or `Miniconda`_, you can install the package from
the `conda-forge`_ channel, which has up-to-date packages for Linux, Windows
and macOS.

To install Scrapy using ``conda``, run::

  conda install -c conda-forge scrapy

Alternatively, if you’re already familiar with installation of Python packages,
you can install Scrapy and its dependencies from PyPI with::

    pip install Scrapy

We strongly recommend that you install Scrapy in :ref:`a dedicated virtualenv <intro-using-virtualenv>`,
to avoid conflicting with your system packages.

Note that sometimes this may require solving compilation issues for some Scrapy
dependencies depending on your operating system, so be sure to check the
:ref:`intro-install-platform-notes`.

For more detailed and platform-specific instructions, as well as
troubleshooting information, read on.

Things that are good to know
----------------------------

Scrapy is written in pure Python and depends on a few key Python packages (among others):

* `lxml`_, an efficient XML and HTML parser
* `parsel`_, an HTML/XML data extraction library written on top of lxml,
* `w3lib`_, a multi-purpose helper for dealing with URLs and web page encodings
* `twisted`_, an asynchronous networking framework
* `cryptography`_ and `pyOpenSSL`_, to deal with various network-level security needs

Some of these packages themselves depend on non-Python packages
that might require additional installation steps depending on your platform.
Please check :ref:`platform-specific guides below <intro-install-platform-notes>`.

In case of any trouble related to these dependencies,
please refer to their respective installation instructions:

* `lxml installation`_
* :doc:`cryptography installation <cryptography:installation>`

.. _lxml installation: https://lxml.de/installation.html

.. _intro-using-virtualenv:

Using a virtual environment (recommended)
-----------------------------------------

TL;DR: We recommend installing Scrapy inside a virtual environment
on all platforms.

Python packages can be installed either globally (a.k.a system wide),
or in user-space. We do not recommend installing Scrapy system wide.

Instead, we recommend that you install Scrapy within a so-called
"virtual environment" (:mod:`venv`).
Virtual environments allow you to not conflict with already-installed Python
system packages (which could break some of your system tools and scripts),
and still install packages normally with ``pip`` (without ``sudo`` and the likes).

See :ref:`tut-venv` on how to create your virtual environment.

Once you have created a virtual environment, you can install Scrapy inside it with ``pip``,
just like any other Python package.
(See :ref:`platform-specific guides <intro-install-platform-notes>`
below for non-Python dependencies that you may need to install beforehand).

.. _intro-install-platform-notes:

Platform specific installation notes
====================================

.. _intro-install-windows:

Windows
-------

Though it's possible to install Scrapy on Windows using pip, we recommend you
install `Anaconda`_ or `Miniconda`_ and use the package from the
`conda-forge`_ channel, which will avoid most installation issues.

Once you've installed `Anaconda`_ or `Miniconda`_, install Scrapy with::

  conda install -c conda-forge scrapy

To install Scrapy on Windows using ``pip``:

.. warning::
    This installation method requires “Microsoft Visual C++” for installing some
    Scrapy dependencies, which demands significantly more disk space than Anaconda.

#. Download and execute `Microsoft C++ Build Tools`_ to install the Visual Studio Installer.

#. Run the Visual Studio Installer.

#. Under the Workloads section, select **C++ build tools**.

#. Check the installation details and make sure following packages are selected as optional components:

    * **MSVC**  (e.g MSVC v142 - VS 2019 C++ x64/x86 build tools (v14.23) )

    * **Windows SDK**  (e.g Windows 10 SDK (10.0.18362.0))

#. Install the Visual Studio Build Tools.

Now, you should be able to :ref:`install Scrapy <intro-install-scrapy>` using ``pip``.

.. _intro-install-ubuntu:

Ubuntu 14.04 or above
---------------------

Scrapy is currently tested with recent-enough versions of lxml,
twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions.
But it should support older versions of Ubuntu too, like Ubuntu 14.04,
albeit with potential issues with TLS connections.

**Don't** use the ``python-scrapy`` package provided by Ubuntu, they are
typically too old and slow to catch up with the latest Scrapy release.

To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install
these dependencies::

    sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

- ``python3-dev``, ``zlib1g-dev``, ``libxml2-dev`` and ``libxslt1-dev``
  are required for ``lxml``
- ``libssl-dev`` and ``libffi-dev`` are required for ``cryptography``

Inside a :ref:`virtualenv <intro-using-virtualenv>`,
you can install Scrapy with ``pip`` after that::

    pip install scrapy

.. note::
    The same non-Python dependencies can be used to install Scrapy in Debian
    Jessie (8.0) and above.

.. _intro-install-macos:

macOS
-----

Building Scrapy's dependencies requires the presence of a C compiler and
development headers. On macOS this is typically provided by Apple’s Xcode
development tools. To install the Xcode command-line tools, open a terminal
window and run::

    xcode-select --install

There's a `known issue <https://github.com/pypa/pip/issues/2468>`_ that
prevents ``pip`` from updating system packages. This has to be addressed to
successfully install Scrapy and its dependencies. Here are some proposed
solutions:

* *(Recommended)* **Don't** use system Python. Install a new, updated version
  that doesn't conflict with the rest of your system. Here's how to do it using
  the `homebrew`_ package manager:

  * Install `homebrew`_ following the instructions in https://brew.sh/

  * Update your ``PATH`` variable to state that homebrew packages should be
    used before system packages (Change ``.bashrc`` to ``.zshrc`` accordingly
    if you're using `zsh`_ as default shell)::

      echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc

  * Reload ``.bashrc`` to ensure the changes have taken place::

      source ~/.bashrc

  * Install python::

      brew install python

*   *(Optional)* :ref:`Install Scrapy inside a Python virtual environment
    <intro-using-virtualenv>`.

  This method is a workaround for the above macOS issue, but it's an overall
  good practice for managing dependencies and can complement the first method.

After any of these workarounds you should be able to install Scrapy::

  pip install Scrapy

PyPy
----

We recommend using the latest PyPy version.
For PyPy3, only Linux installation was tested.

Most Scrapy dependencies now have binary wheels for CPython, but not for PyPy.
This means that these dependencies will be built during installation.
On macOS, you are likely to face an issue with building the Cryptography
dependency. The solution to this problem is described
`here <https://github.com/pyca/cryptography/issues/2692#issuecomment-272773481>`_,
that is to ``brew install openssl`` and then export the flags that this command
recommends (only needed when installing Scrapy). Installing on Linux has no special
issues besides installing build dependencies.
Installing Scrapy with PyPy on Windows is not tested.

You can check that Scrapy is installed correctly by running ``scrapy bench``.
If this command gives errors such as
``TypeError: ... got 2 unexpected keyword arguments``, this means
that setuptools was unable to pick up one PyPy-specific dependency.
To fix this issue, run ``pip install 'PyPyDispatcher>=2.1.0'``.

.. _intro-install-troubleshooting:

Troubleshooting
===============

AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'
----------------------------------------------------------------

After you install or upgrade Scrapy, Twisted or pyOpenSSL, you may get an
exception with the following traceback::

    […]
      File "[…]/site-packages/twisted/protocols/tls.py", line 63, in <module>
        from twisted.internet._sslverify import _setAcceptableProtocols
      File "[…]/site-packages/twisted/internet/_sslverify.py", line 38, in <module>
        TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
    AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'

The reason you get this exception is that your system or virtual environment
has a version of pyOpenSSL that your version of Twisted does not support.

To install a version of pyOpenSSL that your version of Twisted supports,
reinstall Twisted with the :code:`tls` extra option::

    pip install twisted[tls]

For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.

.. _Python: https://www.python.org/
.. _lxml: https://lxml.de/index.html
.. _parsel: https://pypi.org/project/parsel/
.. _w3lib: https://pypi.org/project/w3lib/
.. _twisted: https://twisted.org/
.. _cryptography: https://cryptography.io/en/latest/
.. _pyOpenSSL: https://pypi.org/project/pyOpenSSL/
.. _setuptools: https://pypi.org/pypi/setuptools
.. _homebrew: https://brew.sh/
.. _zsh: https://www.zsh.org/
.. _Anaconda: https://www.anaconda.com/docs/main
.. _Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
.. _Microsoft C++ Build Tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/
.. _conda-forge: https://conda-forge.org/


.. _intro-tutorial:

===============
Scrapy Tutorial
===============

In this tutorial, we'll assume that Scrapy is already installed on your system.
If that's not the case, see :ref:`intro-install`.

We are going to scrape `quotes.toscrape.com <https://quotes.toscrape.com/>`_, a website
that lists quotes from famous authors.

This tutorial will walk you through these tasks:

1. Creating a new Scrapy project
2. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract data
3. Exporting the scraped data using the command line
4. Changing spider to recursively follow links
5. Using spider arguments

Scrapy is written in Python_. The more you learn about Python, the more you
can get out of Scrapy.

If you're already familiar with other languages and want to learn Python quickly, the
`Python Tutorial`_ is a good resource.

If you're new to programming and want to start with Python, the following books
may be useful to you:

* `Automate the Boring Stuff With Python`_

* `How To Think Like a Computer Scientist`_

* `Learn Python 3 The Hard Way`_

You can also take a look at `this list of Python resources for non-programmers`_,
as well as the `suggested resources in the learnpython-subreddit`_.

.. _Python: https://www.python.org/
.. _this list of Python resources for non-programmers: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
.. _Python Tutorial: https://docs.python.org/3/tutorial
.. _Automate the Boring Stuff With Python: https://automatetheboringstuff.com/
.. _How To Think Like a Computer Scientist: http://openbookproject.net/thinkcs/python/english3e/
.. _Learn Python 3 The Hard Way: https://learnpythonthehardway.org/python3/
.. _suggested resources in the learnpython-subreddit: https://www.reddit.com/r/learnpython/wiki/index#wiki_new_to_python.3F

Creating a project
==================

Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you'd like to store your code and run::

    scrapy startproject tutorial

This will create a ``tutorial`` directory with the following contents::

    tutorial/
        scrapy.cfg            # deploy configuration file

        tutorial/             # project's Python module, you'll import your code from here
            __init__.py

            items.py          # project items definition file

            middlewares.py    # project middlewares file

            pipelines.py      # project pipelines file

            settings.py       # project settings file

            spiders/          # a directory where you'll later put your spiders
                __init__.py

Our first Spider
================

Spiders are classes that you define and that Scrapy uses to scrape information from a website
(or a group of websites). They must subclass :class:`~scrapy.Spider` and define the initial
requests to be made, and optionally, how to follow links in pages and parse the downloaded
page content to extract data.

This is the code for our first Spider. Save it in a file named
``quotes_spider.py`` under the ``tutorial/spiders`` directory in your project:

.. code-block:: python

    from pathlib import Path

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        async def start(self):
            urls = [
                "https://quotes.toscrape.com/page/1/",
                "https://quotes.toscrape.com/page/2/",
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = f"quotes-{page}.html"
            Path(filename).write_bytes(response.body)
            self.log(f"Saved file {filename}")

As you can see, our Spider subclasses :class:`scrapy.Spider <scrapy.Spider>`
and defines some attributes and methods:

* :attr:`~scrapy.Spider.name`: identifies the Spider. It must be
  unique within a project, that is, you can't set the same name for different
  Spiders.

* :meth:`~scrapy.Spider.start`: must be an asynchronous generator that
  yields requests (and, optionally, items) for the spider to start crawling.
  Subsequent requests will be generated successively from these initial
  requests.

* :meth:`~scrapy.Spider.parse`: a method that will be called to handle
  the response downloaded for each of the requests made. The response parameter
  is an instance of :class:`~scrapy.http.TextResponse` that holds
  the page content and has further helpful methods to handle it.

  The :meth:`~scrapy.Spider.parse` method usually parses the response, extracting
  the scraped data as dicts and also finding new URLs to
  follow and creating new requests (:class:`~scrapy.Request`) from them.

How to run our spider
---------------------

To put our spider to work, go to the project's top level directory and run::

   scrapy crawl quotes

This command runs the spider named ``quotes`` that we've just added, that
will send some requests for the ``quotes.toscrape.com`` domain. You will get an output
similar to this::

    ... (omitted for brevity)
    2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
    2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
    2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
    2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
    2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
    ...

Now, check the files in the current directory. You should notice that two new
files have been created: *quotes-1.html* and *quotes-2.html*, with the content
for the respective URLs, as our ``parse`` method instructs.

.. note:: If you are wondering why we haven't parsed the HTML yet, hold
  on, we will cover that soon.

What just happened under the hood?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Scrapy sends the first :class:`scrapy.Request <scrapy.Request>` objects yielded
by the :meth:`~scrapy.Spider.start` spider method. Upon receiving a
response for each one, Scrapy calls the callback method associated with the
request (in this case, the ``parse`` method) with a
:class:`~scrapy.http.Response` object.

A shortcut to the ``start`` method
----------------------------------

Instead of implementing a :meth:`~scrapy.Spider.start` method that yields
:class:`~scrapy.Request` objects from URLs, you can define a
:attr:`~scrapy.Spider.start_urls` class attribute with a list of URLs. This
list will then be used by the default implementation of
:meth:`~scrapy.Spider.start` to create the initial requests for your
spider.

.. code-block:: python

    from pathlib import Path

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = f"quotes-{page}.html"
            Path(filename).write_bytes(response.body)

The :meth:`~scrapy.Spider.parse` method will be called to handle each
of the requests for those URLs, even though we haven't explicitly told Scrapy
to do so. This happens because :meth:`~scrapy.Spider.parse` is Scrapy's
default callback method, which is called for requests without an explicitly
assigned callback.

Extracting data
---------------

The best way to learn how to extract data with Scrapy is trying selectors
using the :ref:`Scrapy shell <topics-shell>`. Run::

    scrapy shell 'https://quotes.toscrape.com/page/1/'

.. note::

   Remember to always enclose URLs in quotes when running Scrapy shell from the
   command line, otherwise URLs containing arguments (i.e. ``&`` character)
   will not work.

   On Windows, use double quotes instead::

       scrapy shell "https://quotes.toscrape.com/page/1/"

You will see something like::

    [ ... Scrapy log here ... ]
    2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
    [s]   item       {}
    [s]   request    <GET https://quotes.toscrape.com/page/1/>
    [s]   response   <200 https://quotes.toscrape.com/page/1/>
    [s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
    [s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser

Using the shell, you can try selecting elements using `CSS`_ with the response
object:

.. invisible-code-block: python

    response = load_response('https://quotes.toscrape.com/page/1/', 'quotes1.html')

.. code-block:: pycon

    >>> response.css("title")
    [<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running ``response.css('title')`` is a list-like object called
:class:`~scrapy.selector.SelectorList`, which represents a list of
:class:`~scrapy.Selector` objects that wrap around XML/HTML elements
and allow you to run further queries to refine the selection or extract the
data.

To extract the text from the title above, you can do:

.. code-block:: pycon

    >>> response.css("title::text").getall()
    ['Quotes to Scrape']

There are two things to note here: one is that we've added ``::text`` to the
CSS query, to mean we want to select only the text elements directly inside
``<title>`` element.  If we don't specify ``::text``, we'd get the full title
element, including its tags:

.. code-block:: pycon

    >>> response.css("title").getall()
    ['<title>Quotes to Scrape</title>']

The other thing is that the result of calling ``.getall()`` is a list: it is
possible that a selector returns more than one result, so we extract them all.
When you know you just want the first result, as in this case, you can do:

.. code-block:: pycon

    >>> response.css("title::text").get()
    'Quotes to Scrape'

As an alternative, you could've written:

.. code-block:: pycon

    >>> response.css("title::text")[0].get()
    'Quotes to Scrape'

Accessing an index on a :class:`~scrapy.selector.SelectorList` instance will
raise an :exc:`IndexError` exception if there are no results:

.. code-block:: pycon

    >>> response.css("noelement")[0].get()
    Traceback (most recent call last):
    ...
    IndexError: list index out of range

You might want to use ``.get()`` directly on the
:class:`~scrapy.selector.SelectorList` instance instead, which returns ``None``
if there are no results:

.. code-block:: pycon

    >>> response.css("noelement").get()

There's a lesson here: for most scraping code, you want it to be resilient to
errors due to things not being found on a page, so that even if some parts fail
to be scraped, you can at least get **some** data.

Besides the :meth:`~scrapy.selector.SelectorList.getall` and
:meth:`~scrapy.selector.SelectorList.get` methods, you can also use
the :meth:`~scrapy.selector.SelectorList.re` method to extract using
:doc:`regular expressions <library/re>`:

.. code-block:: pycon

    >>> response.css("title::text").re(r"Quotes.*")
    ['Quotes to Scrape']
    >>> response.css("title::text").re(r"Q\w+")
    ['Quotes']
    >>> response.css("title::text").re(r"(\w+) to (\w+)")
    ['Quotes', 'Scrape']

In order to find the proper CSS selectors to use, you might find it useful to open
the response page from the shell in your web browser using ``view(response)``.
You can use your browser's developer tools to inspect the HTML and come up
with a selector (see :ref:`topics-developer-tools`).

`Selector Gadget`_ is also a nice tool to quickly find CSS selector for
visually selected elements, which works in many browsers.

.. _Selector Gadget: https://selectorgadget.com/

XPath: a brief intro
^^^^^^^^^^^^^^^^^^^^

Besides `CSS`_, Scrapy selectors also support using `XPath`_ expressions:

.. code-block:: pycon

    >>> response.xpath("//title")
    [<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
    >>> response.xpath("//title/text()").get()
    'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy
Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You
can see that if you read the text representation of the selector
objects in the shell closely.

While perhaps not as popular as CSS selectors, XPath expressions offer more
power because besides navigating the structure, it can also look at the
content. Using XPath, you're able to select things like: *the link
that contains the text "Next Page"*. This makes XPath very fitting to the task
of scraping, and we encourage you to learn XPath even if you already know how to
construct CSS selectors, it will make scraping much easier.

We won't cover much of XPath here, but you can read more about :ref:`using XPath
with Scrapy Selectors here <topics-selectors>`. To learn more about XPath, we
recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.

.. _XPath: https://www.w3.org/TR/xpath-10/
.. _CSS: https://www.w3.org/TR/selectors

Extracting quotes and authors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now that you know a bit about selection and extraction, let's complete our
spider by writing the code to extract the quotes from the web page.

Each quote in https://quotes.toscrape.com is represented by HTML elements that look
like this:

.. code-block:: html

    <div class="quote">
        <span class="text">“The world as we have created it is a process of our
        thinking. It cannot be changed without changing our thinking.”</span>
        <span>
            by <small class="author">Albert Einstein</small>
            <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

Let's open up scrapy shell and play a bit to find out how to extract the data
we want::

    scrapy shell 'https://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

.. code-block:: pycon

    >>> response.css("div.quote")
    [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    ...]

Each of the selectors returned by the query above allows us to run further
queries over their sub-elements. Let's assign the first selector to a
variable, so that we can run our CSS selectors directly on a particular quote:

.. code-block:: pycon

    >>> quote = response.css("div.quote")[0]

Now, let's extract the ``text``, ``author`` and ``tags`` from that quote
using the ``quote`` object we just created:

.. code-block:: pycon

    >>> text = quote.css("span.text::text").get()
    >>> text
    '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
    >>> author = quote.css("small.author::text").get()
    >>> author
    'Albert Einstein'

Given that the tags are a list of strings, we can use the ``.getall()`` method
to get all of them:

.. code-block:: pycon

    >>> tags = quote.css("div.tags a.tag::text").getall()
    >>> tags
    ['change', 'deep-thoughts', 'thinking', 'world']

.. invisible-code-block: python

  from sys import version_info

Having figured out how to extract each bit, we can now iterate over all the
quote elements and put them together into a Python dictionary:

.. code-block:: pycon

    >>> for quote in response.css("div.quote"):
    ...     text = quote.css("span.text::text").get()
    ...     author = quote.css("small.author::text").get()
    ...     tags = quote.css("div.tags a.tag::text").getall()
    ...     print(dict(text=text, author=author, tags=tags))
    ...
    {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
    {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
    ...

Extracting data in our spider
-----------------------------

Let's get back to our spider. Until now, it hasn't extracted any data in
particular, just saving the whole HTML page to a local file. Let's integrate the
extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data
extracted from the page. To do that, we use the ``yield`` Python keyword
in the callback, as you can see below:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

To run this spider, exit the scrapy shell by entering::

    quit()

Then, run::

   scrapy crawl quotes

Now, it should output the extracted data with the log::

    2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
    {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
    2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
    {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

.. _storing-data:

Storing the scraped data
========================

The simplest way to store the scraped data is by using :ref:`Feed exports
<topics-feed-exports>`, with the following command::

    scrapy crawl quotes -O quotes.json

That will generate a ``quotes.json`` file containing all scraped items,
serialized in `JSON`_.

The ``-O`` command-line switch overwrites any existing file; use ``-o`` instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as `JSON Lines`_::

    scrapy crawl quotes -o quotes.jsonl

The `JSON Lines`_ format is useful because it's stream-like, so you can easily
append new records to it. It doesn't have the same problem as JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like `JQ`_ to help
do that at the command-line.

In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an :ref:`Item Pipeline <topics-item-pipeline>`. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
``tutorial/pipelines.py``. Though you don't need to implement any item
pipelines if you just want to store the scraped items.

.. _JSON Lines: https://jsonlines.org
.. _JQ: https://stedolan.github.io/jq

Following links
===============

Let's say, instead of just scraping the stuff from the first two pages
from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let's see how to follow links
from them.

The first thing to do is extract the link to the page we want to follow.  Examining
our page, we can see there is a link to the next page with the following
markup:

.. code-block:: html

    <ul class="pager">
        <li class="next">
            <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
        </li>
    </ul>

We can try extracting it in the shell:

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

This gets the anchor element, but we want the attribute ``href``. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:

.. code-block:: pycon

    >>> response.css("li.next a::attr(href)").get()
    '/page/2/'

There is also an ``attrib`` property available
(see :ref:`selecting-attributes` for more):

.. code-block:: pycon

    >>> response.css("li.next a").attrib["href"]
    '/page/2/'

Now let's see our spider, modified to recursively follow the link to the next
page, extracting data from it:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the ``parse()`` method looks for the link to
the next page, builds a full absolute URL using the
:meth:`~scrapy.http.Response.urljoin` method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.

What you see here is Scrapy's mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it's
visiting.

In our example, it creates a sort of loop, following all the links to the next page
until it doesn't find one -- handy for crawling blogs, forums and other sites with
pagination.

.. _response-follow-example:

A shortcut for creating Requests
--------------------------------

As a shortcut for creating Request objects you can use
:meth:`response.follow <scrapy.http.TextResponse.follow>`:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("span small::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)

Unlike scrapy.Request, ``response.follow`` supports relative URLs directly - no
need to call urljoin. Note that ``response.follow`` just returns a Request
instance; you still have to yield this Request.

.. skip: start

You can also pass a selector to ``response.follow`` instead of a string;
this selector should extract necessary attributes:

.. code-block:: python

    for href in response.css("ul.pager a::attr(href)"):
        yield response.follow(href, callback=self.parse)

For ``<a>`` elements there is a shortcut: ``response.follow`` uses their href
attribute automatically. So the code can be shortened further:

.. code-block:: python

    for a in response.css("ul.pager a"):
        yield response.follow(a, callback=self.parse)

To create multiple requests from an iterable, you can use
:meth:`response.follow_all <scrapy.http.TextResponse.follow_all>` instead:

.. code-block:: python

    anchors = response.css("ul.pager a")
    yield from response.follow_all(anchors, callback=self.parse)

or, shortening it further:

.. code-block:: python

    yield from response.follow_all(css="ul.pager a", callback=self.parse)

.. skip: end

More examples and patterns
--------------------------

Here is another spider that illustrates callbacks and following links,
this time for scraping author information:

.. code-block:: python

    import scrapy

    class AuthorSpider(scrapy.Spider):
        name = "author"

        start_urls = ["https://quotes.toscrape.com/"]

        def parse(self, response):
            author_page_links = response.css(".author + a")
            yield from response.follow_all(author_page_links, self.parse_author)

            pagination_links = response.css("li.next a")
            yield from response.follow_all(pagination_links, self.parse)

        def parse_author(self, response):
            def extract_with_css(query):
                return response.css(query).get(default="").strip()

            yield {
                "name": extract_with_css("h3.author-title::text"),
                "birthdate": extract_with_css(".author-born-date::text"),
                "bio": extract_with_css(".author-description::text"),
            }

This spider will start from the main page, it will follow all the links to the
authors pages calling the ``parse_author`` callback for each of them, and also
the pagination links with the ``parse`` callback as we saw before.

Here we're passing callbacks to
:meth:`response.follow_all <scrapy.http.TextResponse.follow_all>` as positional
arguments to make the code shorter; it also works for
:class:`~scrapy.Request`.

The ``parse_author`` callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don't need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured in the
:setting:`DUPEFILTER_CLASS` setting.

Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links,
check out the :class:`~scrapy.spiders.CrawlSpider` class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page,
using a :ref:`trick to pass additional data to the callbacks
<topics-request-response-ref-request-callback-arguments>`.

Using spider arguments
======================

You can provide command line arguments to your spiders by using the ``-a``
option when running them::

    scrapy crawl quotes -O quotes-humor.json -a tag=humor

These arguments are passed to the Spider's ``__init__`` method and become
spider attributes by default.

In this example, the value provided for the ``tag`` argument will be available
via ``self.tag``. You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        async def start(self):
            url = "https://quotes.toscrape.com/"
            tag = getattr(self, "tag", None)
            if tag is not None:
                url = url + "tag/" + tag
            yield scrapy.Request(url, self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                yield response.follow(next_page, self.parse)

If you pass the ``tag=humor`` argument to this spider, you'll notice that it
will only visit URLs from the ``humor`` tag, such as
``https://quotes.toscrape.com/tag/humor``.

You can :ref:`learn more about handling spider arguments here <spiderargs>`.

Next steps
==========

This tutorial covered only the basics of Scrapy, but there's a lot of other
features not mentioned here. Check the :ref:`topics-whatelse` section in the
:ref:`intro-overview` chapter for a quick overview of the most important ones.

You can continue from the section :ref:`section-basics` to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn't covered like
modeling the scraped data. If you'd prefer to play with an example project, check
the :ref:`intro-examples` section.

.. _JSON: https://en.wikipedia.org/wiki/JSON


.. highlight:: none

.. _topics-commands:

=================
Command line tool
=================

Scrapy is controlled through the ``scrapy`` command-line tool, to be referred to
here as the "Scrapy tool" to differentiate it from the sub-commands, which we
just call "commands" or "Scrapy commands".

The Scrapy tool provides several commands, for multiple purposes, and each one
accepts a different set of arguments and options.

(The ``scrapy deploy`` command has been removed in 1.0 in favor of the
standalone ``scrapyd-deploy``. See `Deploying your project`_.)

.. _topics-config-settings:

Configuration settings
======================

Scrapy will look for configuration parameters in ini-style ``scrapy.cfg`` files
in standard locations:

1. ``/etc/scrapy.cfg`` or ``c:\scrapy\scrapy.cfg`` (system-wide),
2. ``~/.config/scrapy.cfg`` (``$XDG_CONFIG_HOME``) and ``~/.scrapy.cfg`` (``$HOME``)
   for global (user-wide) settings, and
3. ``scrapy.cfg`` inside a Scrapy project's root (see next section).

Settings from these files are merged in the listed order of preference:
user-defined values have higher priority than system-wide defaults
and project-wide settings will override all others, when defined.

Scrapy also understands, and can be configured through, a number of environment
variables. Currently these are:

* ``SCRAPY_SETTINGS_MODULE`` (see :ref:`topics-settings-module-envvar`)
* ``SCRAPY_PROJECT`` (see :ref:`topics-project-envvar`)
* ``SCRAPY_PYTHON_SHELL`` (see :ref:`topics-shell`)

.. _topics-project-structure:

Default structure of Scrapy projects
====================================

Before delving into the command-line tool and its sub-commands, let's first
understand the directory structure of a Scrapy project.

Though it can be modified, all Scrapy projects have the same file
structure by default, similar to this::

   scrapy.cfg
   myproject/
       __init__.py
       items.py
       middlewares.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           spider1.py
           spider2.py
           ...

The directory where the ``scrapy.cfg`` file resides is known as the *project
root directory*. That file contains the name of the python module that defines
the project settings. Here is an example:

.. code-block:: ini

    [settings]
    default = myproject.settings

.. _topics-project-envvar:

Sharing the root directory between projects
===========================================

A project root directory, the one that contains the ``scrapy.cfg``, may be
shared by multiple Scrapy projects, each with its own settings module.

In that case, you must define one or more aliases for those settings modules
under ``[settings]`` in your ``scrapy.cfg`` file:

.. code-block:: ini

    [settings]
    default = myproject1.settings
    project1 = myproject1.settings
    project2 = myproject2.settings

By default, the ``scrapy`` command-line tool will use the ``default`` settings.
Use the ``SCRAPY_PROJECT`` environment variable to specify a different project
for ``scrapy`` to use::

    $ scrapy settings --get BOT_NAME
    Project 1 Bot
    $ export SCRAPY_PROJECT=project2
    $ scrapy settings --get BOT_NAME
    Project 2 Bot

Using the ``scrapy`` tool
=========================

You can start by running the Scrapy tool with no arguments and it will print
some usage help and the available commands::

    Scrapy X.Y - no active project

    Usage:
      scrapy <command> [options] [args]

    Available commands:
      crawl         Run a spider
      fetch         Fetch a URL using the Scrapy downloader
    [...]

The first line will print the currently active project if you're inside a
Scrapy project. In this example it was run from outside a project. If run from inside
a project it would have printed something like this::

    Scrapy X.Y - project: myproject

    Usage:
      scrapy <command> [options] [args]

    [...]

Creating projects
-----------------

The first thing you typically do with the ``scrapy`` tool is create your Scrapy
project::

    scrapy startproject myproject [project_dir]

That will create a Scrapy project under the ``project_dir`` directory.
If ``project_dir`` wasn't specified, ``project_dir`` will be the same as ``myproject``.

Next, you go inside the new project directory::

    cd project_dir

And you're ready to use the ``scrapy`` command to manage and control your
project from there.

Controlling projects
--------------------

You use the ``scrapy`` tool from inside your projects to control and manage
them.

For example, to create a new spider::

    scrapy genspider mydomain mydomain.com

Some Scrapy commands (like :command:`crawl`) must be run from inside a Scrapy
project. See the :ref:`commands reference <topics-commands-ref>` below for more
information on which commands must be run from inside projects, and which not.

Also keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider-overridden behaviours (such as the ``custom_settings`` attribute to
override settings) if the url being fetched is associated with some specific
spider. This is intentional, as the ``fetch`` command is meant to be used to
check how spiders are downloading pages.

.. _topics-commands-ref:

Available tool commands
=======================

This section contains a list of the available built-in commands with a
description and some usage examples. Remember, you can always get more info
about each command by running::

    scrapy <command> -h

And you can see all available commands with::

    scrapy -h

There are two kinds of commands, those that only work from inside a Scrapy
project (Project-specific commands) and those that also work without an active
Scrapy project (Global commands), though they may behave slightly differently
when run from inside a project (as they would use the project overridden
settings).

Global commands:

* :command:`startproject`
* :command:`genspider`
* :command:`settings`
* :command:`runspider`
* :command:`shell`
* :command:`fetch`
* :command:`view`
* :command:`version`

Project-only commands:

* :command:`crawl`
* :command:`check`
* :command:`list`
* :command:`edit`
* :command:`parse`
* :command:`bench`

.. command:: startproject

startproject
------------

* Syntax: ``scrapy startproject <project_name> [project_dir]``
* Requires project: *no*

Creates a new Scrapy project named ``project_name``, under the ``project_dir``
directory.
If ``project_dir`` wasn't specified, ``project_dir`` will be the same as ``project_name``.

Usage example::

    $ scrapy startproject myproject

.. command:: genspider

genspider
---------

* Syntax: ``scrapy genspider [-t template] <name> <domain or URL>``
* Requires project: *no*

Creates a new spider in the current folder or in the current project's ``spiders`` folder, if called from inside a project. The ``<name>`` parameter is set as the spider's ``name``, while ``<domain or URL>`` is used to generate the ``allowed_domains`` and ``start_urls`` spider's attributes.

Usage example::

    $ scrapy genspider -l
    Available templates:
      basic
      crawl
      csvfeed
      xmlfeed

    $ scrapy genspider example example.com
    Created spider 'example' using template 'basic'

    $ scrapy genspider -t crawl scrapyorg scrapy.org
    Created spider 'scrapyorg' using template 'crawl'

This is just a convenient shortcut command for creating spiders based on
pre-defined templates, but certainly not the only way to create spiders. You
can just create the spider source code files yourself, instead of using this
command.

.. command:: crawl

crawl
-----

* Syntax: ``scrapy crawl <spider>``
* Requires project: *yes*

Start crawling using a spider.

Supported options:

* ``-h, --help``: show a help message and exit

* ``-a NAME=VALUE``: set a spider argument (may be repeated)

* ``--output FILE`` or ``-o FILE``: append scraped items to the end of FILE (use - for stdout). To define the output format, set a colon at the end of the output URI (i.e. ``-o FILE:FORMAT``)

* ``--overwrite-output FILE`` or ``-O FILE``: dump scraped items into FILE, overwriting any existing file. To define the output format, set a colon at the end of the output URI (i.e. ``-O FILE:FORMAT``)

Usage examples::

    $ scrapy crawl myspider
    [ ... myspider starts crawling ... ]

    $ scrapy crawl -o myfile:csv myspider
    [ ... myspider starts crawling and appends the result to the file myfile in csv format ... ]

    $ scrapy crawl -O myfile:json myspider
    [ ... myspider starts crawling and saves the result in myfile in json format overwriting the original content... ]

.. command:: check

check
-----

* Syntax: ``scrapy check [-l] <spider>``
* Requires project: *yes*

Run contract checks.

.. skip: start

Usage examples::

    $ scrapy check -l
    first_spider
      * parse
      * parse_item
    second_spider
      * parse
      * parse_item

    $ scrapy check
    [FAILED] first_spider:parse_item
    >>> 'RetailPricex' field is missing

    [FAILED] first_spider:parse
    >>> Returned 92 requests, expected 0..4

.. skip: end

.. command:: list

list
----

* Syntax: ``scrapy list``
* Requires project: *yes*

List all available spiders in the current project. The output is one spider per
line.

Usage example::

    $ scrapy list
    spider1
    spider2

.. command:: edit

edit
----

* Syntax: ``scrapy edit <spider>``
* Requires project: *yes*

Edit the given spider using the editor defined in the ``EDITOR`` environment
variable or (if unset) the :setting:`EDITOR` setting.

This command is provided only as a convenient shortcut for the most common
case, the developer is of course free to choose any tool or IDE to write and
debug spiders.

Usage example::

    $ scrapy edit spider1

.. command:: fetch

fetch
-----

* Syntax: ``scrapy fetch <url>``
* Requires project: *no*

Downloads the given URL using the Scrapy downloader and writes the contents to
standard output.

The interesting thing about this command is that it fetches the page the way the
spider would download it. For example, if the spider has a ``USER_AGENT``
attribute which overrides the User Agent, it will use that one.

So this command can be used to "see" how your spider would fetch a certain page.

If used outside a project, no particular per-spider behaviour would be applied
and it will just use the default Scrapy downloader settings.

Supported options:

* ``--spider=SPIDER``: bypass spider autodetection and force use of specific spider

* ``--headers``: print the response's HTTP headers instead of the response's body

* ``--no-redirect``: do not follow HTTP 3xx redirects (default is to follow them)

Usage examples::

    $ scrapy fetch --nolog http://www.example.com/some/page.html
    [ ... html content here ... ]

    $ scrapy fetch --nolog --headers http://www.example.com/
    {'Accept-Ranges': ['bytes'],
     'Age': ['1263   '],
     'Connection': ['close     '],
     'Content-Length': ['596'],
     'Content-Type': ['text/html; charset=UTF-8'],
     'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
     'Etag': ['"573c1-254-48c9c87349680"'],
     'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
     'Server': ['Apache/2.2.3 (CentOS)']}

.. command:: view

view
----

* Syntax: ``scrapy view <url>``
* Requires project: *no*

Opens the given URL in a browser, as your Scrapy spider would "see" it.
Sometimes spiders see pages differently from regular users, so this can be used
to check what the spider "sees" and confirm it's what you expect.

Supported options:

* ``--spider=SPIDER``: bypass spider autodetection and force use of specific spider

* ``--no-redirect``: do not follow HTTP 3xx redirects (default is to follow them)

Usage example::

    $ scrapy view http://www.example.com/some/page.html
    [ ... browser starts ... ]

.. command:: shell

shell
-----

* Syntax: ``scrapy shell [url]``
* Requires project: *no*

Starts the Scrapy shell for the given URL (if given) or empty if no URL is
given. Also supports UNIX-style local file paths, either relative with
``./`` or ``../`` prefixes or absolute file paths.
See :ref:`topics-shell` for more info.

Supported options:

* ``--spider=SPIDER``: bypass spider autodetection and force use of specific spider

* ``-c code``: evaluate the code in the shell, print the result and exit

* ``--no-redirect``: do not follow HTTP 3xx redirects (default is to follow them);
  this only affects the URL you may pass as argument on the command line;
  once you are inside the shell, ``fetch(url)`` will still follow HTTP redirects by default.

Usage example::

    $ scrapy shell http://www.example.com/some/page.html
    [ ... scrapy shell starts ... ]

    $ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
    (200, 'http://www.example.com/')

    # shell follows HTTP redirects by default
    $ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
    (200, 'http://example.com/')

    # you can disable this with --no-redirect
    # (only for the URL passed as command line argument)
    $ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
    (302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

.. command:: parse

parse
-----

* Syntax: ``scrapy parse <url> [options]``
* Requires project: *yes*

Fetches the given URL and parses it with the spider that handles it, using the
method passed with the ``--callback`` option, or ``parse`` if not given.

Supported options:

* ``--spider=SPIDER``: bypass spider autodetection and force use of specific spider

* ``--a NAME=VALUE``: set spider argument (may be repeated)

* ``--callback`` or ``-c``: spider method to use as callback for parsing the
  response

* ``--meta`` or ``-m``: additional request meta that will be passed to the callback
  request. This must be a valid json string. Example: --meta='{"foo" : "bar"}'

* ``--cbkwargs``: additional keyword arguments that will be passed to the callback.
  This must be a valid json string. Example: --cbkwargs='{"foo" : "bar"}'

* ``--pipelines``: process items through pipelines

* ``--rules`` or ``-r``: use :class:`~scrapy.spiders.CrawlSpider`
  rules to discover the callback (i.e. spider method) to use for parsing the
  response

* ``--noitems``: don't show scraped items

* ``--nolinks``: don't show extracted links

* ``--nocolour``: avoid using pygments to colorize the output

* ``--depth`` or ``-d``: depth level for which the requests should be followed
  recursively (default: 1)

* ``--verbose`` or ``-v``: display information for each depth level

* ``--output`` or ``-o``: dump scraped items to a file

.. skip: start

Usage example::

    $ scrapy parse http://www.example.com/ -c parse_item
    [ ... scrapy log lines crawling example.com spider ... ]

    >>> STATUS DEPTH LEVEL 1 <<<
    # Scraped Items  ------------------------------------------------------------
    [{'name': 'Example item',
     'category': 'Furniture',
     'length': '12 cm'}]

    # Requests  -----------------------------------------------------------------
    []

.. skip: end

.. command:: settings

settings
--------

* Syntax: ``scrapy settings [options]``
* Requires project: *no*

Get the value of a Scrapy setting.

If used inside a project it'll show the project setting value, otherwise it'll
show the default Scrapy value for that setting.

Example usage::

    $ scrapy settings --get BOT_NAME
    scrapybot
    $ scrapy settings --get DOWNLOAD_DELAY
    0

.. command:: runspider

runspider
---------

* Syntax: ``scrapy runspider <spider_file.py>``
* Requires project: *no*

Run a spider self-contained in a Python file, without having to create a
project.

Example usage::

    $ scrapy runspider myspider.py
    [ ... spider starts crawling ... ]

.. command:: version

version
-------

* Syntax: ``scrapy version [-v]``
* Requires project: *no*

Prints the Scrapy version. If used with ``-v`` it also prints Python, Twisted
and Platform info, which is useful for bug reports.

.. command:: bench

bench
-----

* Syntax: ``scrapy bench``
* Requires project: *no*

Run a quick benchmark test. :ref:`benchmarking`.

.. _topics-commands-crawlerprocess:

Commands that run a crawl
=========================

Many commands need to run a crawl of some kind, running either a user-provided
spider or a special internal one:

* :command:`bench`
* :command:`check`
* :command:`crawl`
* :command:`fetch`
* :command:`parse`
* :command:`runspider`
* :command:`shell`
* :command:`view`

They use an internal instance of :class:`scrapy.crawler.AsyncCrawlerProcess` or
:class:`scrapy.crawler.CrawlerProcess` for this. In most cases this detail
shouldn't matter to the user running the command, but when the user :ref:`needs
a non-default Twisted reactor <disable-asyncio>`, it may be important.

Scrapy decides which of these two classes to use based on the value of the
:setting:`TWISTED_REACTOR` setting. If the setting value is the default one
(``'twisted.internet.asyncioreactor.AsyncioSelectorReactor'``),
:class:`~scrapy.crawler.AsyncCrawlerProcess` will be used, otherwise
:class:`~scrapy.crawler.CrawlerProcess` will be used. The :ref:`spider settings
<spider-settings>` are not taken into account when doing this, as they are
loaded after this decision is made. This may cause an error if the
project-level setting is set to :ref:`the asyncio reactor <install-asyncio>`
(:ref:`explicitly <project-settings>` or :ref:`by using the Scrapy default
<default-settings>`) and :ref:`the setting of the spider being run
<spider-settings>` is set to :ref:`a different one <disable-asyncio>`, because
:class:`~scrapy.crawler.AsyncCrawlerProcess` only supports the asyncio reactor.
In this case you should set the :setting:`FORCE_CRAWLER_PROCESS` setting to
``True`` (at the project level or via the command line) so that Scrapy uses
:class:`~scrapy.crawler.CrawlerProcess` which supports all reactors.

Custom project commands
=======================

You can also add your custom project commands by using the
:setting:`COMMANDS_MODULE` setting. See the Scrapy commands in
`scrapy/commands`_ for examples on how to implement your commands.

.. _scrapy/commands: https://github.com/scrapy/scrapy/tree/master/scrapy/commands
.. setting:: COMMANDS_MODULE

COMMANDS_MODULE
---------------

Default: ``''`` (empty string)

A module to use for looking up custom Scrapy commands. This is used to add custom
commands for your Scrapy project.

Example:

.. code-block:: python

    COMMANDS_MODULE = "mybot.commands"

.. _Deploying your project: https://scrapyd.readthedocs.io/en/latest/deploy.html

Register commands via setup.py entry points
-------------------------------------------

You can also add Scrapy commands from an external library by adding a
``scrapy.commands`` section in the entry points of the library ``setup.py``
file.

The following example adds ``my_command`` command:

.. skip: next

.. code-block:: python

  from setuptools import setup, find_packages

  setup(
      name="scrapy-mymodule",
      entry_points={
          "scrapy.commands": [
              "my_command=my_scrapy_module.commands:MyCommand",
          ],
      },
  )


.. _topics-addons:

=======
Add-ons
=======

Scrapy's add-on system is a framework which unifies managing and configuring
components that extend Scrapy's core functionality, such as middlewares,
extensions, or pipelines. It provides users with a plug-and-play experience in
Scrapy extension management, and grants extensive configuration control to
developers.

Activating and configuring add-ons
==================================

During :class:`~scrapy.crawler.Crawler` initialization, the list of enabled
add-ons is read from your ``ADDONS`` setting.

The ``ADDONS`` setting is a dict in which every key is an add-on class or its
import path and the value is its priority.

This is an example where two add-ons are enabled in a project's
``settings.py``::

    ADDONS = {
        'path.to.someaddon': 0,
        SomeAddonClass: 1,
    }

Writing your own add-ons
========================

Add-ons are :ref:`components <topics-components>` that include one or both of
the following methods:

.. method:: update_settings(settings)

    This method is called during the initialization of the
    :class:`~scrapy.crawler.Crawler`. Here, you should perform dependency checks
    (e.g. for external Python libraries) and update the
    :class:`~scrapy.settings.Settings` object as wished, e.g. enable components
    for this add-on or set required configuration of other extensions.

    :param settings: The settings object storing Scrapy/component configuration
    :type settings: :class:`~scrapy.settings.Settings`

.. classmethod:: update_pre_crawler_settings(cls, settings)

    Use this class method instead of the :meth:`update_settings` method to
    update :ref:`pre-crawler settings <pre-crawler-settings>` whose value is
    used before the :class:`~scrapy.crawler.Crawler` object is created.

    :param settings: The settings object storing Scrapy/component configuration
    :type settings: :class:`~scrapy.settings.BaseSettings`

The settings set by the add-on should use the ``addon`` priority (see
:ref:`populating-settings` and :func:`scrapy.settings.BaseSettings.set`)::

    class MyAddon:
        def update_settings(self, settings):
            settings.set("DNSCACHE_ENABLED", True, "addon")

This allows users to override these settings in the project or spider
configuration.

When editing the value of a setting instead of overriding it entirely, it is
usually best to leave its priority unchanged. For example, when editing a
:ref:`component priority dictionary <component-priority-dictionaries>`.

If the ``update_settings`` method raises
:exc:`scrapy.exceptions.NotConfigured`, the add-on will be skipped. This makes
it easy to enable an add-on only when some conditions are met.

Fallbacks
---------

Some components provided by add-ons need to fall back to "default"
implementations, e.g. a custom download handler needs to send the request that
it doesn't handle via the default download handler, or a stats collector that
includes some additional processing but otherwise uses the default stats
collector. And it's possible that a project needs to use several custom
components of the same type, e.g. two custom download handlers that support
different kinds of custom requests and still need to use the default download
handler for other requests. To make such use cases easier to configure, we
recommend that such custom components should be written in the following way:

1. The custom component (e.g. ``MyDownloadHandler``) shouldn't inherit from the
   default Scrapy one (e.g.
   ``scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler``), but instead
   be able to load the class of the fallback component from a special setting
   (e.g. ``MY_FALLBACK_DOWNLOAD_HANDLER``), create an instance of it and use
   it.
2. The add-ons that include these components should read the current value of
   the default setting (e.g. ``DOWNLOAD_HANDLERS``) in their
   ``update_settings()`` methods, save that value into the fallback setting
   (``MY_FALLBACK_DOWNLOAD_HANDLER`` mentioned earlier) and set the default
   setting to the component provided by the add-on (e.g.
   ``MyDownloadHandler``). If the fallback setting is already set by the user,
   it should not be changed.
3. This way, if there are several add-ons that want to modify the same setting,
   all of them will fall back to the component from the previous one and then to
   the Scrapy default. The order of that depends on the priority order in the
   ``ADDONS`` setting.

Add-on examples
===============

Set some basic configuration:

.. skip: next
.. code-block:: python

    from myproject.pipelines import MyPipeline

    class MyAddon:
        def update_settings(self, settings):
            settings.set("DNSCACHE_ENABLED", True, "addon")
            settings.remove_from_list("METAREFRESH_IGNORE_TAGS", "noscript")
            settings.setdefault_in_component_priority_dict(
                "ITEM_PIPELINES", MyPipeline, 200
            )

.. _priority-dict-helpers:

.. tip:: When editing a :ref:`component priority dictionary
    <component-priority-dictionaries>` setting, like :setting:`ITEM_PIPELINES`,
    consider using setting methods like
    :meth:`~scrapy.settings.BaseSettings.replace_in_component_priority_dict`,
    :meth:`~scrapy.settings.BaseSettings.set_in_component_priority_dict`
    and
    :meth:`~scrapy.settings.BaseSettings.setdefault_in_component_priority_dict`
    to avoid mistakes.

Check dependencies:

.. code-block:: python

    class MyAddon:
        def update_settings(self, settings):
            try:
                import boto
            except ImportError:
                raise NotConfigured("MyAddon requires the boto library")
            ...

Access the crawler instance:

.. code-block:: python

    class MyAddon:
        def __init__(self, crawler) -> None:
            super().__init__()
            self.crawler = crawler

        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)

        def update_settings(self, settings): ...

Use a fallback component:

.. code-block:: python

    from scrapy.utils.misc import build_from_crawler

    FALLBACK_SETTING = "MY_FALLBACK_DOWNLOAD_HANDLER"

    class MyHandler:
        lazy = False

        def __init__(self, crawler):
            dhcls = load_object(crawler.settings.get(FALLBACK_SETTING))
            self._fallback_handler = build_from_crawler(dhcls, crawler)

        async def download_request(self, request):
            if request.meta.get("my_params"):
                # handle the request
                ...
            else:
                return await self._fallback_handler.download_request(request)

        async def close(self):
            pass

    class MyAddon:
        def update_settings(self, settings):
            if not settings.get(FALLBACK_SETTING):
                settings.set(
                    FALLBACK_SETTING,
                    settings.getwithbase("DOWNLOAD_HANDLERS")["https"],
                    "addon",
                )
            settings["DOWNLOAD_HANDLERS"]["https"] = MyHandler


.. _topics-api:

========
Core API
========

This section documents the Scrapy core API, and it's intended for developers of
extensions and middlewares.

.. _topics-api-crawler:

Crawler API
===========

The main entry point to the Scrapy API is the :class:`~scrapy.crawler.Crawler`
object, which :ref:`components <topics-components>` can :ref:`get for
initialization <from-crawler>`. It provides access to all Scrapy core
components, and it is the only way for components to access them and hook their
functionality into Scrapy.

.. module:: scrapy.crawler
   :synopsis: The Scrapy crawler

The Extension Manager is responsible for loading and keeping track of installed
extensions and it's configured through the :setting:`EXTENSIONS` setting which
contains a dictionary of all available extensions and their order similar to
how you :ref:`configure the downloader middlewares
<topics-downloader-middleware-setting>`.

.. autoclass:: Crawler
    :members: get_addon, get_downloader_middleware, get_extension,
        get_item_pipeline, get_spider_middleware

    The Crawler object must be instantiated with a
    :class:`scrapy.Spider` subclass and a
    :class:`scrapy.settings.Settings` object.

    .. attribute:: request_fingerprinter

        The request fingerprint builder of this crawler.

        This is used from extensions and middlewares to build short, unique
        identifiers for requests. See :ref:`request-fingerprints`.

    .. attribute:: settings

        The settings manager of this crawler.

        This is used by extensions & middlewares to access the Scrapy settings
        of this crawler.

        For an introduction on Scrapy settings see :ref:`topics-settings`.

        For the API see :class:`~scrapy.settings.Settings` class.

    .. attribute:: signals

        The signals manager of this crawler.

        This is used by extensions & middlewares to hook themselves into Scrapy
        functionality.

        For an introduction on signals see :ref:`topics-signals`.

        For the API see :class:`~scrapy.signalmanager.SignalManager` class.

    .. attribute:: stats

        The stats collector of this crawler.

        This is used from extensions & middlewares to record stats of their
        behaviour, or access stats collected by other extensions.

        For an introduction on stats collection see :ref:`topics-stats`.

        For the API see :class:`~scrapy.statscollectors.StatsCollector` class.

    .. attribute:: extensions

        The extension manager that keeps track of enabled extensions.

        Most extensions won't need to access this attribute.

        For an introduction on extensions and a list of available extensions on
        Scrapy see :ref:`topics-extensions`.

    .. attribute:: engine

        The execution engine, which coordinates the core crawling logic
        between the scheduler, downloader and spiders.

        Some extension may want to access the Scrapy engine, to inspect  or
        modify the downloader and scheduler behaviour, although this is an
        advanced use and this API is not yet stable.

    .. attribute:: spider

        Spider currently being crawled. This is an instance of the spider class
        provided while constructing the crawler, and it is created after the
        arguments given in the :meth:`crawl` method.

    .. automethod:: crawl_async

    .. automethod:: crawl

    .. automethod:: stop_async

    .. automethod:: stop

.. autoclass:: AsyncCrawlerRunner
   :members:

.. autoclass:: CrawlerRunner
   :members:

.. autoclass:: AsyncCrawlerProcess
   :show-inheritance:
   :members:
   :inherited-members:

.. autoclass:: CrawlerProcess
   :show-inheritance:
   :members:
   :inherited-members:

.. _topics-api-settings:

Settings API
============

.. module:: scrapy.settings
   :synopsis: Settings manager

.. attribute:: SETTINGS_PRIORITIES

    Dictionary that sets the key name and priority level of the default
    settings priorities used in Scrapy.

    Each item defines a settings entry point, giving it a code name for
    identification and an integer priority. Greater priorities take more
    precedence over lesser ones when setting and retrieving values in the
    :class:`~scrapy.settings.Settings` class.

    .. code-block:: python

        SETTINGS_PRIORITIES = {
            "default": 0,
            "command": 10,
            "addon": 15,
            "project": 20,
            "spider": 30,
            "cmdline": 40,
        }

    For a detailed explanation on each settings sources, see:
    :ref:`topics-settings`.

.. autofunction:: get_settings_priority

.. autoclass:: Settings
   :show-inheritance:
   :members:

.. autoclass:: BaseSettings
   :members:

.. _topics-api-spiderloader:

SpiderLoader API
================

.. module:: scrapy.spiderloader
   :synopsis: The spider loader

.. class:: SpiderLoader

    This class is in charge of retrieving and handling the spider classes
    defined across the project.

    Custom spider loaders can be employed by specifying their path in the
    :setting:`SPIDER_LOADER_CLASS` project setting. They must fully implement
    the :class:`scrapy.interfaces.ISpiderLoader` interface to guarantee an
    errorless execution.

    .. method:: from_settings(settings)

       This class method is used by Scrapy to create an instance of the class.
       It's called with the current project settings, and it loads the spiders
       found recursively in the modules of the :setting:`SPIDER_MODULES`
       setting.

       :param settings: project settings
       :type settings: :class:`~scrapy.settings.Settings` instance

    .. method:: load(spider_name)

       Get the Spider class with the given name. It'll look into the previously
       loaded spiders for a spider class with name ``spider_name`` and will raise
       a KeyError if not found.

       :param spider_name: spider class name
       :type spider_name: str

    .. method:: list()

       Get the names of the available spiders in the project.

    .. method:: find_by_request(request)

       List the spiders' names that can handle the given request. Will try to
       match the request's url against the domains of the spiders.

       :param request: queried request
       :type request: :class:`~scrapy.Request` instance

.. autoclass:: DummySpiderLoader

.. _topics-api-signals:

Signals API
===========

.. automodule:: scrapy.signalmanager
    :synopsis: The signal manager
    :members:
    :undoc-members:

.. _topics-api-stats:

Stats Collector API
===================

There are several Stats Collectors available under the
:mod:`scrapy.statscollectors` module and they all implement the Stats
Collector API defined by the :class:`~scrapy.statscollectors.StatsCollector`
class (which they all inherit from).

.. module:: scrapy.statscollectors
   :synopsis: Stats Collectors

.. class:: StatsCollector

    .. method:: get_value(key, default=None)

        Return the value for the given stats key or default if it doesn't exist.

    .. method:: get_stats()

        Get all stats from the currently running spider as a dict.

    .. method:: set_value(key, value)

        Set the given value for the given stats key.

    .. method:: set_stats(stats)

        Override the current stats with the dict passed in ``stats`` argument.

    .. method:: inc_value(key, count=1, start=0)

        Increment the value of the given stats key, by the given count,
        assuming the start value given (when it's not set).

    .. method:: max_value(key, value)

        Set the given value for the given key only if current value for the
        same key is lower than value. If there is no current value for the
        given key, the value is always set.

    .. method:: min_value(key, value)

        Set the given value for the given key only if current value for the
        same key is greater than value. If there is no current value for the
        given key, the value is always set.

    .. method:: clear_stats()

        Clear all stats.

    The following methods are not part of the stats collection api but instead
    used when implementing custom stats collectors:

    .. method:: open_spider()

        Open the spider for stats collection.

    .. method:: close_spider()

        Close the spider. After this is called, no more specific stats
        can be accessed or collected.

Engine API
==========

.. autoclass:: scrapy.core.engine.ExecutionEngine()
   :members: needs_backout


.. _topics-architecture:

=====================
Architecture overview
=====================

This document describes the architecture of Scrapy and how its components
interact.

Overview
========

The following diagram shows an overview of the Scrapy architecture with its
components and an outline of the data flow that takes place inside the system
(shown by the red arrows). A brief description of the components is included
below with links for more detailed information about them. The data flow is
also described below.

.. _data-flow:

Data flow
=========

.. image:: https://Scrapy.readthedocs.io/en/latest/_images/scrapy_architecture_02.png
   :width: 700
   :height: 470
   :alt: Scrapy architecture

The data flow in Scrapy is controlled by the execution engine, and goes like
this:

1. The :ref:`Engine <component-engine>` gets the initial Requests to crawl from the
   :ref:`Spider <component-spiders>`.

2. The :ref:`Engine <component-engine>` schedules the Requests in the
   :ref:`Scheduler <component-scheduler>` and asks for the
   next Requests to crawl.

3. The :ref:`Scheduler <component-scheduler>` returns the next Requests
   to the :ref:`Engine <component-engine>`.

4. The :ref:`Engine <component-engine>` sends the Requests to the
   :ref:`Downloader <component-downloader>`, passing through the
   :ref:`Downloader Middlewares <component-downloader-middleware>` (see
   :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_request`).

5. Once the page finishes downloading the
   :ref:`Downloader <component-downloader>` generates a Response (with
   that page) and sends it to the Engine, passing through the
   :ref:`Downloader Middlewares <component-downloader-middleware>` (see
   :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response`).

6. The :ref:`Engine <component-engine>` receives the Response from the
   :ref:`Downloader <component-downloader>` and sends it to the
   :ref:`Spider <component-spiders>` for processing, passing
   through the :ref:`Spider Middleware <component-spider-middleware>` (see
   :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_input`).

7. The :ref:`Spider <component-spiders>` processes the Response and returns
   scraped items and new Requests (to follow) to the
   :ref:`Engine <component-engine>`, passing through the
   :ref:`Spider Middleware <component-spider-middleware>` (see
   :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output`).

8. The :ref:`Engine <component-engine>` sends processed items to
   :ref:`Item Pipelines <component-pipelines>`, then send processed Requests to
   the :ref:`Scheduler <component-scheduler>` and asks for possible next Requests
   to crawl.

9. The process repeats (from step 3) until there are no more requests from the
   :ref:`Scheduler <component-scheduler>`.

Components
==========

.. _component-engine:

Scrapy Engine
-------------

The engine is responsible for controlling the data flow between all components
of the system, and triggering events when certain actions occur. See the
:ref:`Data Flow <data-flow>` section above for more details.

.. _component-scheduler:

Scheduler
---------

The :ref:`scheduler <topics-scheduler>` receives requests from the engine and
enqueues them for feeding them later (also to the engine) when the engine
requests them.

.. _component-downloader:

Downloader
----------

The Downloader is responsible for fetching web pages and feeding them to the
engine which, in turn, feeds them to the spiders.

.. _component-spiders:

Spiders
-------

Spiders are custom classes written by Scrapy users to parse responses and
extract :ref:`items <topics-items>` from them or additional requests to
follow. For more information see :ref:`topics-spiders`.

.. _component-pipelines:

Item Pipeline
-------------

The Item Pipeline is responsible for processing the items once they have been
extracted (or scraped) by the spiders. Typical tasks include cleansing,
validation and persistence (like storing the item in a database). For more
information see :ref:`topics-item-pipeline`.

.. _component-downloader-middleware:

Downloader middlewares
----------------------

Downloader middlewares are specific hooks that sit between the Engine and the
Downloader and process requests when they pass from the Engine to the
Downloader, and responses that pass from Downloader to the Engine.

Use a Downloader middleware if you need to do one of the following:

* process a request just before it is sent to the Downloader
  (i.e. right before Scrapy sends the request to the website);
* change received response before passing it to a spider;
* send a new Request instead of passing received response to a spider;
* pass response to a spider without fetching a web page;
* silently drop some requests.

For more information see :ref:`topics-downloader-middleware`.

.. _component-spider-middleware:

Spider middlewares
------------------

Spider middlewares are specific hooks that sit between the Engine and the
Spiders and are able to process spider input (responses) and output (items and
requests).

Use a Spider middleware if you need to

* post-process output of spider callbacks - change/add/remove requests or items;
* post-process start requests or items;
* handle spider exceptions;
* call errback instead of callback for some of the requests based on response
  content.

For more information see :ref:`topics-spider-middleware`.

Event-driven networking
=======================

Scrapy is written with `Twisted`_, a popular event-driven networking framework
for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code
for concurrency.

For more information about asynchronous programming and Twisted see these
links:

* :doc:`twisted:core/howto/defer-intro`
* `Twisted Introduction - Krondo`_

.. _Twisted: https://twisted.org/
.. _Twisted Introduction - Krondo: https://krondo.com/an-introduction-to-asynchronous-programming-and-twisted/


.. _using-asyncio:

=======
asyncio
=======

Scrapy has partial support for :mod:`asyncio`. After you :ref:`install the
asyncio reactor <install-asyncio>`, you may use :mod:`asyncio` and
:mod:`asyncio`-powered libraries in any :doc:`coroutine <coroutines>`.

.. _install-asyncio:

Installing the asyncio reactor
==============================

To enable :mod:`asyncio` support, your :setting:`TWISTED_REACTOR` setting needs
to be set to ``'twisted.internet.asyncioreactor.AsyncioSelectorReactor'``,
which is the default value.

If you are using :class:`~scrapy.crawler.AsyncCrawlerRunner` or
:class:`~scrapy.crawler.CrawlerRunner`, you also need to
install the :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`
reactor manually. You can do that using
:func:`~scrapy.utils.reactor.install_reactor`:

.. skip: next
.. code-block:: python

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")

.. _asyncio-preinstalled-reactor:

Handling a pre-installed reactor
================================

``twisted.internet.reactor`` and some other Twisted imports install the default
Twisted reactor as a side effect. Once a Twisted reactor is installed, it is
not possible to switch to a different reactor at run time.

If you :ref:`configure the asyncio Twisted reactor <install-asyncio>` and, at
run time, Scrapy complains that a different reactor is already installed,
chances are you have some such imports in your code.

You can usually fix the issue by moving those offending module-level Twisted
imports to the method or function definitions where they are used. For example,
if you have something like:

.. skip: next
.. code-block:: python

    from twisted.internet import reactor

    def my_function():
        reactor.callLater(...)

Switch to something like:

.. code-block:: python

    def my_function():
        from twisted.internet import reactor

        reactor.callLater(...)

Alternatively, you can try to :ref:`manually install the asyncio reactor
<install-asyncio>`, with :func:`~scrapy.utils.reactor.install_reactor`, before
those imports happen.

.. _asyncio-await-dfd:

Integrating Deferred code and asyncio code
==========================================

Coroutine functions can await on Deferreds by wrapping them into
:class:`asyncio.Future` objects. Scrapy provides two helpers for this:

.. autofunction:: scrapy.utils.defer.deferred_to_future
.. autofunction:: scrapy.utils.defer.maybe_deferred_to_future

.. tip:: If you don't need to support reactors other than the default
         :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`, you
         can use :func:`~scrapy.utils.defer.deferred_to_future`, otherwise you
         should use :func:`~scrapy.utils.defer.maybe_deferred_to_future`.

.. tip:: If you need to use these functions in code that aims to be compatible
         with lower versions of Scrapy that do not provide these functions,
         down to Scrapy 2.0 (earlier versions do not support
         :mod:`asyncio`), you can copy the implementation of these functions
         into your own code.

Coroutines and futures can be wrapped into Deferreds (for example, when a
Scrapy API requires passing a Deferred to it) using the following helpers:

.. autofunction:: scrapy.utils.defer.deferred_from_coro
.. autofunction:: scrapy.utils.defer.deferred_f_from_coro_f
.. autofunction:: scrapy.utils.defer.ensure_awaitable

.. _enforce-asyncio-requirement:

Enforcing asyncio as a requirement
==================================

If you are writing a :ref:`component <topics-components>` that requires asyncio
to work, use :func:`scrapy.utils.asyncio.is_asyncio_available` to
:ref:`enforce it as a requirement <enforce-component-requirements>`. For
example:

.. code-block:: python

    from scrapy.utils.asyncio import is_asyncio_available

    class MyComponent:
        def __init__(self):
            if not is_asyncio_available():
                raise ValueError(
                    f"{MyComponent.__qualname__} requires the asyncio support. "
                    f"Make sure you have configured the asyncio reactor in the "
                    f"TWISTED_REACTOR setting. See the asyncio documentation "
                    f"of Scrapy for more information."
                )

.. autofunction:: scrapy.utils.asyncio.is_asyncio_available
.. autofunction:: scrapy.utils.reactor.is_asyncio_reactor_installed

.. _asyncio-without-reactor:

Using Scrapy without a Twisted reactor
======================================

.. versionadded:: 2.15.0

.. warning::
    This is currently experimental and may not be suitable for production use.

It's possible to use Scrapy without installing a Twisted reactor at all, by
setting the :setting:`TWISTED_REACTOR_ENABLED` setting to ``False``. In this
mode Scrapy will use the asyncio event loop directly, and most of the Scrapy
functionality will work in the same way.

Doing this provides several benefits in certain use cases:

* A Twisted reactor, once stopped, cannot be started again. This prevents, for
  example, using several instances of
  :class:`~scrapy.crawler.AsyncCrawlerProcess` in the same process when they
  use a reactor, but with ``TWISTED_REACTOR_ENABLED=False`` it becomes
  possible.
* There may be limitations imposed by
  :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor` and related
  Twisted code, such as the requirement of using
  :class:`~asyncio.SelectorEventLoop` on Windows (see :ref:`asyncio-windows`),
  that do not apply if the reactor is not used.
* :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor` manages the
  underlying event loop, and while :class:`~scrapy.crawler.AsyncCrawlerRunner`
  can use a pre-existing reactor which, in turn, can use a pre-existing event
  loop, it's easier to use :class:`~scrapy.crawler.AsyncCrawlerRunner` with a
  pre-existing loop directly.
* Omitting the reactor machinery may improve performance and reliability.

Limitations
-----------

As some Scrapy features and components require a reactor, they don't work and
are disabled without it. Replacements that don't require a reactor may be added
in future Scrapy versions. The following features are not available:

* The default HTTP(S) download handler,
  :class:`~scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler` (this
  is likely the biggest difference; Scrapy provides an HTTP(S) download handler
  that doesn't require a reactor and will be used instead of it:
  :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`)
* :class:`~scrapy.core.downloader.handlers.ftp.FTPDownloadHandler`
* :class:`~scrapy.core.downloader.handlers.http2.H2DownloadHandler`
* :ref:`topics-telnetconsole`
* :class:`~scrapy.crawler.CrawlerRunner` and
  :class:`~scrapy.crawler.CrawlerProcess`
  (:class:`~scrapy.crawler.AsyncCrawlerProcess` and
  :class:`~scrapy.crawler.AsyncCrawlerRunner` are available)
* Twisted-specific DNS resolvers (the :setting:`DNS_RESOLVER` setting)
* User and 3rd-party code that requires a reactor (see :ref:`below
  <asyncio-without-reactor-migrate>` for examples)

Note that importing Twisted modules and, among other things, creating and using
:class:`~twisted.internet.defer.Deferred` objects doesn't require a reactor, so
code that uses :class:`~twisted.internet.defer.Deferred`,
:class:`~twisted.python.failure.Failure` and some other Twisted APIs will not
necessarily stop working.

Other differences
-----------------

When :setting:`TWISTED_REACTOR_ENABLED` is set to ``False``, Scrapy will change
the defaults of some other settings:

* :setting:`TELNETCONSOLE_ENABLED` is set to ``False``.
* The ``"http"`` and ``"https"`` keys in :setting:`DOWNLOAD_HANDLERS_BASE` are
  set to ``"scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler"``.
* The ``"ftp"`` key in :setting:`DOWNLOAD_HANDLERS_BASE` is set to ``None``.

Thus, :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler` is
used by default for making HTTP(S) requests. Please refer to its documentation
for its differences and limitations compared to
:class:`~scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler`.

Additionally, :class:`~scrapy.crawler.AsyncCrawlerProcess` will install a
:term:`meta path finder` that prevents :mod:`twisted.internet.reactor` from
being imported.

.. _asyncio-without-reactor-migrate:

Adding support to existing code
-------------------------------

Code that doesn't directly use Twisted APIs or APIs that depend on Twisted ones
doesn't need special support for running without a reactor.

Here are some examples of APIs and patterns that need a replacement:

* Using :meth:`reactor.callLater()
  <twisted.internet.base.ReactorBase.callLater>` for sleeping or delayed calls.
  You can use :meth:`asyncio.loop.call_later` instead.
* Using :func:`twisted.internet.threads.deferToThread`,
  :meth:`reactor.callFromThread()
  <twisted.internet.base.ReactorBase.callFromThread>` and related APIs to
  execute code in other threads. You can use :func:`asyncio.to_thread`,
  :meth:`asyncio.loop.call_soon_threadsafe` and related APIs instead.
* Using :class:`twisted.internet.task.LoopingCall` for scheduling repeated
  tasks. As there is no direct replacement in the standard library, you may
  need to write your own one using :func:`asyncio.sleep` in a task.
* Using Twisted network client and server APIs (:meth:`reactor.connectTCP()
  <twisted.internet.interfaces.IReactorTCP.connectTCP>`,
  :meth:`reactor.listenTCP()
  <twisted.internet.interfaces.IReactorTCP.listenTCP>`,
  :mod:`twisted.web.client`, :mod:`twisted.mail.smtp` etc.). You can use other
  built-in or 3rd-party libraries for this.
* Using :class:`~scrapy.crawler.CrawlerProcess` or
  :class:`~scrapy.crawler.CrawlerRunner`. You should use
  :class:`~scrapy.crawler.AsyncCrawlerProcess` or
  :class:`~scrapy.crawler.AsyncCrawlerRunner` respectively instead.
* Checking whether ``asyncio`` support is available with
  :func:`scrapy.utils.reactor.is_asyncio_reactor_installed`. You should use
  :func:`scrapy.utils.asyncio.is_asyncio_available` instead.

Scrapy provides unified helpers for some of these examples:

.. autofunction:: scrapy.utils.asyncio.call_later
.. autofunction:: scrapy.utils.asyncio.create_looping_call
.. autoclass:: scrapy.utils.asyncio.AsyncioLoopingCall
.. autofunction:: scrapy.utils.asyncio.run_in_thread

If your code needs to know whether the reactor is available, you can either
check for the value of the :setting:`TWISTED_REACTOR_ENABLED` setting (you need
access to the :class:`~scrapy.crawler.Crawler` instance to do this) or use the
following function:

.. autofunction:: scrapy.utils.reactorless.is_reactorless

In general, code that doesn't use the reactor (directly or indirectly) can be
used unmodified both with the asyncio reactor and without a reactor. This
includes code that converts Deferreds to futures and vice versa as described in
:ref:`asyncio-await-dfd`.

Troubleshooting
---------------

**ImportError: Import of twisted.internet.reactor is forbidden when running
without a Twisted reactor [...]:** Scrapy is configured to run without a
reactor, but some code imported :mod:`twisted.internet.reactor`, most likely
because that code needs a reactor to be used. You need to stop using this code
or set :setting:`TWISTED_REACTOR_ENABLED` back to ``True``. It's also possible
that the reactor isn't really needed but was installed due to the problem
described in :ref:`asyncio-preinstalled-reactor`, in which case it should be
enough to fix the problematic imports.

**RuntimeError: TWISTED_REACTOR_ENABLED is False but a Twisted reactor is
installed:** Scrapy is configured to run without a reactor, but a reactor is
already installed before the Scrapy code is executed. If you are trying to set
:setting:`TWISTED_REACTOR_ENABLED` via :ref:`per-spider settings
<spider-settings>`, it's currently unsupported.

**RuntimeError: We expected a Twisted reactor to be installed but it isn't:**
Scrapy is configured to run with a reactor and not to install one, but a
reactor wasn't installed before the Scrapy code is executed. If you are trying
to set :setting:`TWISTED_REACTOR_ENABLED` via :ref:`per-spider settings
<spider-settings>`, it's currently unsupported.

**RuntimeError: <class> doesn't support TWISTED_REACTOR_ENABLED=False:** The
listed class cannot be used with :setting:`TWISTED_REACTOR_ENABLED` set to
``False``. There may be a replacement in the :ref:`documentation above
<asyncio-without-reactor>` or the documentation of the affected class.

.. _asyncio-windows:

Windows-specific notes
======================

The Windows implementation of :mod:`asyncio` can use two event loop
implementations, :class:`~asyncio.ProactorEventLoop` (default) and
:class:`~asyncio.SelectorEventLoop`. However, only
:class:`~asyncio.SelectorEventLoop` works with Twisted.

Scrapy changes the event loop class to :class:`~asyncio.SelectorEventLoop`
automatically when you change the :setting:`TWISTED_REACTOR` setting or call
:func:`~scrapy.utils.reactor.install_reactor`.

.. note:: Other libraries you use may require
          :class:`~asyncio.ProactorEventLoop`, e.g. because it supports
          subprocesses (this is the case with `playwright`_), so you cannot use
          them together with Scrapy on Windows (but you should be able to use
          them on WSL or native Linux).

.. note:: This problem doesn't apply when not using the reactor, see
    :ref:`asyncio-without-reactor`.

.. _playwright: https://github.com/microsoft/playwright-python

.. _using-custom-loops:

Using custom asyncio loops
==========================

You can also use custom asyncio event loops with the asyncio reactor. Set the
:setting:`ASYNCIO_EVENT_LOOP` setting to the import path of the desired event
loop class to use it instead of the default asyncio event loop.

.. _disable-asyncio:

Switching to a non-asyncio reactor
==================================

If for some reason your code doesn't work with the asyncio reactor, you can use
a different reactor by setting the :setting:`TWISTED_REACTOR` setting to its
import path (e.g. ``'twisted.internet.epollreactor.EPollReactor'``) or to
``None``, which will use the default reactor for your platform. If you are
using :class:`~scrapy.crawler.AsyncCrawlerRunner` or
:class:`~scrapy.crawler.AsyncCrawlerProcess` you also need to switch to their
Deferred-based counterparts: :class:`~scrapy.crawler.CrawlerRunner` or
:class:`~scrapy.crawler.CrawlerProcess` respectively.


.. _topics-autothrottle:

======================
AutoThrottle extension
======================

This is an extension for automatically throttling crawling speed based on load
of both the Scrapy server and the website you are crawling.

Design goals
============

1. be nicer to sites instead of using default download delay of zero
2. automatically adjust Scrapy to the optimum crawling speed, so the user
   doesn't have to tune the download delays to find the optimum one.
   The user only needs to specify the maximum concurrent requests
   it allows, and the extension does the rest.

.. _autothrottle-algorithm:

How it works
============

Scrapy allows defining the concurrency and delay of different download slots,
e.g. through the :setting:`DOWNLOAD_SLOTS` setting. By default requests are
assigned to slots based on their URL domain, although it is possible to
customize the download slot of any request.

The AutoThrottle extension adjusts the delay of each download slot dynamically,
to make your spider send :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent
requests on average to each remote website.

It uses download latency to compute the delays. The main idea is the
following: if a server needs ``latency`` seconds to respond, a client
should send a request each ``latency/N`` seconds to have ``N`` requests
processed in parallel.

Instead of adjusting the delays one can just set a small fixed
download delay and impose hard limits on concurrency using
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN`. It will provide a similar
effect, but there are some important differences:

* because the download delay is small there will be occasional bursts
  of requests;
* often non-200 (error) responses can be returned faster than regular
  responses, so with a small download delay and a hard concurrency limit
  crawler will be sending requests to server faster when server starts to
  return errors. But this is an opposite of what crawler should do - in case
  of errors it makes more sense to slow down: these errors may be caused by
  the high request rate.

AutoThrottle doesn't have these issues.

Throttling algorithm
====================

AutoThrottle algorithm adjusts download delays based on the following rules:

1. spiders always start with a download delay of
   :setting:`AUTOTHROTTLE_START_DELAY`;
2. when a response is received, the target download delay is calculated as
   ``latency / N`` where ``latency`` is a latency of the response,
   and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
3. download delay for next requests is set to the average of previous
   download delay and the target download delay;
4. latencies of non-200 responses are not allowed to decrease the delay;
5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
   than :setting:`AUTOTHROTTLE_MAX_DELAY`

.. note:: The AutoThrottle extension honours the standard Scrapy settings for
   concurrency and delay. This means that it will respect
   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
   never set a download delay lower than :setting:`DOWNLOAD_DELAY`.

.. _download-latency:

In Scrapy, the download latency is measured as the time elapsed between
establishing the TCP connection and receiving the HTTP headers.

Note that these latencies are very hard to measure accurately in a cooperative
multitasking environment because Scrapy may be busy processing a spider
callback, for example, and unable to attend downloads. However, these latencies
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
server) is, and this extension builds on that premise.

.. reqmeta:: autothrottle_dont_adjust_delay

Prevent specific requests from triggering slot delay adjustments
================================================================

AutoThrottle adjusts the delay of download slots based on the latencies of
responses that belong to that download slot. The only exceptions are non-200
responses, which are only taken into account to increase that delay, but
ignored if they would decrease that delay.

You can also set the ``autothrottle_dont_adjust_delay`` request metadata key to
``True`` in any request to prevent its response latency from impacting the
delay of its download slot:

.. code-block:: python

    from scrapy import Request

    Request("https://example.com", meta={"autothrottle_dont_adjust_delay": True})

Note, however, that AutoThrottle still determines the starting delay of every
download slot by setting the ``download_delay`` attribute on the running
spider. If you want AutoThrottle not to impact a download slot at all, in
addition to setting this meta key in all requests that use that download slot,
you might want to set a custom value for the ``delay`` attribute of that
download slot, e.g. using :setting:`DOWNLOAD_SLOTS`.

Settings
========

The settings used to control the AutoThrottle extension are:

* :setting:`AUTOTHROTTLE_ENABLED`
* :setting:`AUTOTHROTTLE_START_DELAY`
* :setting:`AUTOTHROTTLE_MAX_DELAY`
* :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`
* :setting:`AUTOTHROTTLE_DEBUG`
* :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
* :setting:`DOWNLOAD_DELAY`

For more information see :ref:`autothrottle-algorithm`.

.. setting:: AUTOTHROTTLE_ENABLED

AUTOTHROTTLE_ENABLED
~~~~~~~~~~~~~~~~~~~~

Default: ``False``

Enables the AutoThrottle extension.

.. setting:: AUTOTHROTTLE_START_DELAY

AUTOTHROTTLE_START_DELAY
~~~~~~~~~~~~~~~~~~~~~~~~

Default: ``5.0``

The initial download delay (in seconds).

.. setting:: AUTOTHROTTLE_MAX_DELAY

AUTOTHROTTLE_MAX_DELAY
~~~~~~~~~~~~~~~~~~~~~~

Default: ``60.0``

The maximum download delay (in seconds) to be set in case of high latencies.

.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY

AUTOTHROTTLE_TARGET_CONCURRENCY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Default: ``1.0``

Average number of requests Scrapy should be sending in parallel to remote
websites. It must be higher than ``0.0``.

By default, AutoThrottle adjusts the delay to send a single
concurrent request to each of the remote websites. Set this option to
a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
(e.g. ``0.5``) makes the crawler more conservative and polite.

Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` is still respected
when AutoThrottle extension is enabled. This means that if
``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN`, the crawler won't reach this number
of concurrent requests.

At every given time point Scrapy can be sending more or less concurrent
requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
value the crawler tries to approach, not a hard limit.

.. setting:: AUTOTHROTTLE_DEBUG

AUTOTHROTTLE_DEBUG
~~~~~~~~~~~~~~~~~~

Default: ``False``

Enable AutoThrottle debug mode which will display stats on every response
received, so you can see how the throttling parameters are being adjusted in
real time.


.. _benchmarking:

============
Benchmarking
============

Scrapy comes with a simple benchmarking suite that spawns a local HTTP server
and crawls it at the maximum possible speed. The goal of this benchmarking is
to get an idea of how Scrapy performs in your hardware, in order to have a
common baseline for comparisons. It uses a simple spider that does nothing and
just follows links.

To run it use::

    scrapy bench

You should see an output like this::

    2016-12-16 21:18:48 [scrapy.utils.log] INFO: Scrapy 1.2.2 started (bot: quotesbot)
    2016-12-16 21:18:48 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['quotesbot.spiders'], 'LOGSTATS_INTERVAL': 1, 'BOT_NAME': 'quotesbot', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'quotesbot.spiders'}
    2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.closespider.CloseSpider',
     'scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats']
    2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2016-12-16 21:18:49 [scrapy.core.engine] INFO: Spider opened
    2016-12-16 21:18:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:50 [scrapy.extensions.logstats] INFO: Crawled 70 pages (at 4200 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:51 [scrapy.extensions.logstats] INFO: Crawled 134 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:52 [scrapy.extensions.logstats] INFO: Crawled 198 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:53 [scrapy.extensions.logstats] INFO: Crawled 254 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:54 [scrapy.extensions.logstats] INFO: Crawled 302 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:55 [scrapy.extensions.logstats] INFO: Crawled 358 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:56 [scrapy.extensions.logstats] INFO: Crawled 406 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:57 [scrapy.extensions.logstats] INFO: Crawled 438 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:58 [scrapy.extensions.logstats] INFO: Crawled 470 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:18:59 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
    2016-12-16 21:18:59 [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:19:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 229995,
     'downloader/request_count': 534,
     'downloader/request_method_count/GET': 534,
     'downloader/response_bytes': 1565504,
     'downloader/response_count': 534,
     'downloader/response_status_count/200': 534,
     'finish_reason': 'closespider_timeout',
     'finish_time': datetime.datetime(2016, 12, 16, 16, 19, 0, 647725),
     'log_count/INFO': 17,
     'request_depth_max': 19,
     'response_received_count': 534,
     'scheduler/dequeued': 533,
     'scheduler/dequeued/memory': 533,
     'scheduler/enqueued': 10661,
     'scheduler/enqueued/memory': 10661,
     'start_time': datetime.datetime(2016, 12, 16, 16, 18, 49, 799869)}
    2016-12-16 21:19:00 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

That tells you that Scrapy is able to crawl about 3000 pages per minute in the
hardware where you run it. Note that this is a very simple spider intended to
follow links, any custom spider you write will probably do more stuff which
results in slower crawl rates. How slower depends on how much your spider does
and how well it's written.

Use scrapy-bench_ for more complex benchmarking.

.. _scrapy-bench: https://github.com/scrapy/scrapy-bench


.. _topics-broad-crawls:

============
Broad Crawls
============

Scrapy defaults are optimized for crawling specific sites. These sites are
often handled by a single Scrapy spider, although this is not necessary or
required (for example, there are generic spiders that handle any given site
thrown at them).

In addition to this "focused crawl", there is another common type of crawling
which covers a large (potentially unlimited) number of domains, and is only
limited by time or other arbitrary constraint, rather than stopping when the
domain was crawled to completion or when there are no more requests to perform.
These are called "broad crawls" and is the typical crawlers employed by search
engines.

These are some common properties often found in broad crawls:

* they crawl many domains (often, unbounded) instead of a specific set of sites

* they don't necessarily crawl domains to completion, because it would be
  impractical (or impossible) to do so, and instead limit the crawl by time or
  number of pages crawled

* they are simpler in logic (as opposed to very complex spiders with many
  extraction rules) because data is often post-processed in a separate stage

* they crawl many domains concurrently, which allows them to achieve faster
  crawl speeds by not being limited by any particular site constraint (each site
  is crawled slowly to respect politeness, but many sites are crawled in
  parallel)

As said above, Scrapy default settings are optimized for focused crawls, not
broad crawls. However, due to its asynchronous architecture, Scrapy is very
well suited for performing fast broad crawls. This page summarizes some things
you need to keep in mind when using Scrapy for doing broad crawls, along with
concrete suggestions of Scrapy settings to tune in order to achieve an
efficient broad crawl.

.. _broad-crawls-scheduler-priority-queue:

.. _broad-crawls-concurrency:

Increase concurrency
====================

Concurrency is the number of requests that are processed in parallel. There is
a global limit (:setting:`CONCURRENT_REQUESTS`) and an additional limit that
can be set per domain (:setting:`CONCURRENT_REQUESTS_PER_DOMAIN`).

The default global concurrency limit in Scrapy is not suitable for crawling
many different domains in parallel, so you will want to increase it. How much
to increase it will depend on how much CPU and memory your crawler will have
available.

A good starting point is ``100``:

.. code-block:: python

    CONCURRENT_REQUESTS = 100

But the best way to find out is by doing some trials and identifying at what
concurrency your Scrapy process gets CPU bounded. For optimum performance, you
should pick a concurrency where CPU usage is at 80-90%.

Increasing concurrency also increases memory usage. If memory usage is a
concern, you might need to lower your global concurrency limit accordingly.

Increase Twisted IO thread pool maximum size
============================================

Currently Scrapy does DNS resolution in a blocking way with usage of thread
pool. With higher concurrency levels the crawling could be slow or even fail
hitting DNS resolver timeouts. Possible solution to increase the number of
threads handling DNS queries. The DNS queue will be processed faster speeding
up establishing of connection and crawling overall.

To increase maximum thread pool size use:

.. code-block:: python

    REACTOR_THREADPOOL_MAXSIZE = 20

Setup your own DNS
==================

If you have multiple crawling processes and single central DNS, it can act
like DoS attack on the DNS server resulting to slow down of entire network or
even blocking your machines. To avoid this setup your own DNS server with
local cache and upstream to some large DNS like OpenDNS or Verizon.

Reduce log level
================

When doing broad crawls you are often only interested in the crawl rates you
get and any errors found. These stats are reported by Scrapy when using the
``INFO`` log level. In order to save CPU (and log storage requirements) you
should not use ``DEBUG`` log level when performing large broad crawls in
production. Using ``DEBUG`` level when developing your (broad) crawler may be
fine though.

To set the log level use:

.. code-block:: python

    LOG_LEVEL = "INFO"

Disable cookies
===============

Disable cookies unless you *really* need. Cookies are often not needed when
doing broad crawls (search engine crawlers ignore them), and they improve
performance by saving some CPU cycles and reducing the memory footprint of your
Scrapy crawler.

To disable cookies use:

.. code-block:: python

    COOKIES_ENABLED = False

Disable retries
===============

Retrying failed HTTP requests can slow down the crawls substantially, especially
when sites causes are very slow (or fail) to respond, thus causing a timeout
error which gets retried many times, unnecessarily, preventing crawler capacity
to be reused for other domains.

To disable retries use:

.. code-block:: python

    RETRY_ENABLED = False

Reduce download timeout
=======================

Unless you are crawling from a very slow connection (which shouldn't be the
case for broad crawls) reduce the download timeout so that stuck requests are
discarded quickly and free up capacity to process the next ones.

To reduce the download timeout use:

.. code-block:: python

    DOWNLOAD_TIMEOUT = 15

Disable redirects
=================

Consider disabling redirects, unless you are interested in following them. When
doing broad crawls it's common to save redirects and resolve them when
revisiting the site at a later crawl. This also help to keep the number of
request constant per crawl batch, otherwise redirect loops may cause the
crawler to dedicate too many resources on any specific domain.

To disable redirects use:

.. code-block:: python

    REDIRECT_ENABLED = False

.. _broad-crawls-bfo:

Crawl in BFO order
==================

:ref:`Scrapy crawls in DFO order by default <faq-bfo-dfo>`.

In broad crawls, however, page crawling tends to be faster than page
processing. As a result, unprocessed early requests stay in memory until the
final depth is reached, which can significantly increase memory usage.

:ref:`Crawl in BFO order <faq-bfo-dfo>` instead to save memory.

Be mindful of memory leaks
==========================

If your broad crawl shows a high memory usage, in addition to :ref:`crawling in
BFO order <broad-crawls-bfo>` and :ref:`lowering concurrency
<broad-crawls-concurrency>` you should :ref:`debug your memory leaks
<topics-leaks>`.

Install a specific Twisted reactor
==================================

If the crawl is exceeding the system's capabilities, you might want to try
installing a specific Twisted reactor, via the :setting:`TWISTED_REACTOR` setting.


.. _topics-components:

==========
Components
==========

A Scrapy component is any class whose objects are built using
:func:`~scrapy.utils.misc.build_from_crawler`.

That includes the classes that you may assign to the following settings:

-   :setting:`ADDONS`

-   :setting:`TWISTED_DNS_RESOLVER`

-   :setting:`DOWNLOAD_HANDLERS`

-   :setting:`DOWNLOADER_MIDDLEWARES`

-   :setting:`DUPEFILTER_CLASS`

-   :setting:`EXTENSIONS`

-   :setting:`FEED_EXPORTERS`

-   :setting:`FEED_STORAGES`

-   :setting:`ITEM_PIPELINES`

-   :setting:`SCHEDULER`

-   :setting:`SCHEDULER_DISK_QUEUE`

-   :setting:`SCHEDULER_MEMORY_QUEUE`

-   :setting:`SCHEDULER_PRIORITY_QUEUE`

-   :setting:`SCHEDULER_START_DISK_QUEUE`

-   :setting:`SCHEDULER_START_MEMORY_QUEUE`

-   :setting:`SPIDER_MIDDLEWARES`

Third-party Scrapy components may also let you define additional Scrapy
components, usually configurable through :ref:`settings <topics-settings>`, to
modify their behavior.

.. _from-crawler:

Initializing from the crawler
=============================

Any Scrapy component may optionally define the following class method:

.. classmethod:: from_crawler(cls, crawler: scrapy.crawler.Crawler, *args, **kwargs)

    Return an instance of the component based on *crawler*.

    *args* and *kwargs* are component-specific arguments that some components
    receive. However, most components do not get any arguments, and instead
    :ref:`use settings <component-settings>`.

    If a component class defines this method, this class method is called to
    create any instance of the component.

    The *crawler* object provides access to all Scrapy core components like
    :ref:`settings <topics-settings>` and :ref:`signals <topics-signals>`,
    allowing the component to access them and hook its functionality into
    Scrapy.

.. _component-settings:

Settings
========

Components can be configured through :ref:`settings <topics-settings>`.

Components can read any setting from the
:attr:`~scrapy.crawler.Crawler.settings` attribute of the
:class:`~scrapy.crawler.Crawler` object they can :ref:`get for initialization
<from-crawler>`. That includes both built-in and custom settings.

For example:

.. code-block:: python

    class MyExtension:
        @classmethod
        def from_crawler(cls, crawler):
            settings = crawler.settings
            return cls(settings.getbool("LOG_ENABLED"))

        def __init__(self, log_is_enabled=False):
            if log_is_enabled:
                print("log is enabled!")

Components do not need to declare their custom settings programmatically.
However, they should document them, so that users know they exist and how to
use them.

It is a good practice to prefix custom settings with the name of the component,
to avoid collisions with custom settings of other existing (or future)
components. For example, an extension called ``WarcCaching`` could prefix its
custom settings with ``WARC_CACHING_``.

Another good practice, mainly for components meant for :ref:`component priority
dictionaries <component-priority-dictionaries>`, is to provide a boolean setting
called ``<PREFIX>_ENABLED`` (e.g. ``WARC_CACHING_ENABLED``) to allow toggling
that component on and off without changing the component priority dictionary
setting. You can usually check the value of such a setting during
initialization, and if ``False``, raise
:exc:`~scrapy.exceptions.NotConfigured`.

When choosing a name for a custom setting, it is also a good idea to have a
look at the names of :ref:`built-in settings <topics-settings-ref>`, to try to
maintain consistency with them.

.. _enforce-component-requirements:

Enforcing requirements
======================

Sometimes, your components may only be intended to work under certain
conditions. For example, they may require a minimum version of Scrapy to work as
intended, or they may require certain settings to have specific values.

In addition to describing those conditions in the documentation of your
component, it is a good practice to raise an exception from the ``__init__``
method of your component if those conditions are not met at run time.

In the case of :ref:`downloader middlewares <topics-downloader-middleware>`,
:ref:`extensions <topics-extensions>`, :ref:`item pipelines
<topics-item-pipeline>`, and :ref:`spider middlewares
<topics-spider-middleware>`, you should raise
:exc:`~scrapy.exceptions.NotConfigured`, passing a description of the issue as
a parameter to the exception so that it is printed in the logs, for the user to
see. For other components, feel free to raise whatever other exception feels
right to you; for example, :exc:`RuntimeError` would make sense for a Scrapy
version mismatch, while :exc:`ValueError` may be better if the issue is the
value of a setting.

If your requirement is a minimum Scrapy version, you may use
:attr:`scrapy.__version__` to enforce your requirement. For example:

.. code-block:: python

    from packaging.version import parse as parse_version

    import scrapy

    class MyComponent:
        def __init__(self):
            if parse_version(scrapy.__version__) < parse_version("2.7"):
                raise RuntimeError(
                    f"{MyComponent.__qualname__} requires Scrapy 2.7 or "
                    f"later, which allow defining the process_spider_output "
                    f"method of spider middlewares as an asynchronous "
                    f"generator."
                )

API reference
=============

The following function can be used to create an instance of a component class:

.. autofunction:: scrapy.utils.misc.build_from_crawler

The following function can also be useful when implementing a component, to
report the import path of the component class, e.g. when reporting problems:

.. autofunction:: scrapy.utils.python.global_object_name


.. _topics-contracts:

=================
Spiders Contracts
=================

Testing spiders can get particularly annoying and while nothing prevents you
from writing unit tests the task gets cumbersome quickly. Scrapy offers an
integrated way of testing your spiders by the means of contracts.

This allows you to test each callback of your spider by hardcoding a sample url
and check various constraints for how the callback processes the response. Each
contract is prefixed with an ``@`` and included in the docstring. See the
following example:

.. code-block:: python

    def parse(self, response):
        """
        This function parses a sample response. Some contracts are mingled
        with this docstring.

        @url http://www.example.com/s?field-keywords=selfish+gene
        @returns items 1 16
        @returns requests 0 0
        @scrapes Title Author Year Price
        """

You can use the following contracts:

.. module:: scrapy.contracts.default

.. class:: UrlContract

    This contract (``@url``) sets the sample URL used when checking other
    contract conditions for this spider. This contract is mandatory. All
    callbacks lacking this contract are ignored when running the checks::

    @url url

.. class:: CallbackKeywordArgumentsContract

    This contract (``@cb_kwargs``) sets the :attr:`cb_kwargs <scrapy.Request.cb_kwargs>`
    attribute for the sample request. It must be a valid JSON dictionary.
    ::

    @cb_kwargs {"arg1": "value1", "arg2": "value2", ...}

.. class:: MetadataContract

    This contract (``@meta``) sets the :attr:`meta <scrapy.Request.meta>`
    attribute for the sample request. It must be a valid JSON dictionary.
    ::

    @meta {"arg1": "value1", "arg2": "value2", ...}

.. class:: ReturnsContract

    This contract (``@returns``) sets lower and upper bounds for the items and
    requests returned by the spider. The upper bound is optional::

    @returns item(s)|request(s) [min [max]]

.. class:: ScrapesContract

    This contract (``@scrapes``) checks that all the items returned by the
    callback have the specified fields::

    @scrapes field_1 field_2 ...

Use the :command:`check` command to run the contract checks.

Custom Contracts
================

If you find you need more power than the built-in Scrapy contracts you can
create and load your own contracts in the project by using the
:setting:`SPIDER_CONTRACTS` setting:

.. code-block:: python

    SPIDER_CONTRACTS = {
        "myproject.contracts.ResponseCheck": 10,
        "myproject.contracts.ItemValidate": 10,
    }

Each contract must inherit from :class:`~scrapy.contracts.Contract` and can
override three methods:

.. module:: scrapy.contracts

.. class:: Contract(method, *args)

    :param method: callback function to which the contract is associated
    :type method: collections.abc.Callable

    :param args: list of arguments passed into the docstring (whitespace
        separated)
    :type args: list

    .. method:: Contract.adjust_request_args(args)

        This receives a ``dict`` as an argument containing default arguments
        for request object. :class:`~scrapy.Request` is used by default,
        but this can be changed with the ``request_cls`` attribute.
        If multiple contracts in chain have this attribute defined, the last one is used.

        Must return the same or a modified version of it.

    .. method:: Contract.pre_process(response)

        This allows hooking in various checks on the response received from the
        sample request, before it's being passed to the callback.

    .. method:: Contract.post_process(output)

        This allows processing the output of the callback. Iterators are
        converted to lists before being passed to this hook.

Raise :class:`~scrapy.exceptions.ContractFail` from
:class:`~scrapy.contracts.Contract.pre_process` or
:class:`~scrapy.contracts.Contract.post_process` if expectations are not met:

.. autoclass:: scrapy.exceptions.ContractFail

Here is a demo contract which checks the presence of a custom header in the
response received:

.. skip: next
.. code-block:: python

    from scrapy.contracts import Contract
    from scrapy.exceptions import ContractFail

    class HasHeaderContract(Contract):
        """
        Demo contract which checks the presence of a custom header
        @has_header X-CustomHeader
        """

        name = "has_header"

        def pre_process(self, response):
            for header in self.args:
                if header not in response.headers:
                    raise ContractFail("X-CustomHeader not present")

.. _detecting-contract-check-runs:

Detecting check runs
====================

When ``scrapy check`` is running, the ``SCRAPY_CHECK`` environment variable is
set to the ``true`` string. You can use :data:`os.environ` to perform any change to
your spiders or your settings when ``scrapy check`` is used:

.. code-block:: python

    import os
    import scrapy

    class ExampleSpider(scrapy.Spider):
        name = "example"

        def __init__(self):
            if os.environ.get("SCRAPY_CHECK"):
                pass  # Do some scraper adjustments when a check is running


.. _topics-coroutines:

==========
Coroutines
==========

Scrapy :ref:`supports <coroutine-support>` the :ref:`coroutine syntax <async>`
(i.e. ``async def``).

.. _coroutine-support:

Supported callables
===================

The following callables may be defined as coroutines using ``async def``, and
hence use coroutine syntax (e.g. ``await``, ``async for``, ``async with``):

-   The :meth:`~scrapy.spiders.Spider.start` spider method, which *must* be
    defined as an :term:`asynchronous generator`.

    .. versionadded:: 2.13

-   :class:`~scrapy.Request` callbacks.

    If you are using any custom or third-party :ref:`spider middleware
    <topics-spider-middleware>`, see :ref:`sync-async-spider-middleware`.

-   The :meth:`process_item` method of
    :ref:`item pipelines <topics-item-pipeline>`.

-   The
    :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_request`,
    :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response`,
    and
    :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_exception`
    methods of
    :ref:`downloader middlewares <topics-downloader-middleware-custom>`.

-   The
    :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output`
    method of :ref:`spider middlewares <topics-spider-middleware>`.

    If defined as a coroutine, it must be an :term:`asynchronous generator`.
    The input ``result`` parameter is an :term:`asynchronous iterable`.

    See also :ref:`sync-async-spider-middleware` and
    :ref:`universal-spider-middleware`.

-   The :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_start` method
    of :ref:`spider middlewares <custom-spider-middleware>`, which *must* be
    defined as an :term:`asynchronous generator`.

    .. versionadded:: 2.13

-   :ref:`Signal handlers that support deferreds <signal-deferred>`.

-   Methods of :ref:`download handlers <topics-download-handlers>`.

    .. versionadded:: 2.14

.. _coroutine-deferred-apis:

Using Deferred-based APIs
=========================

In addition to native coroutine APIs Scrapy has some APIs that return a
:class:`~twisted.internet.defer.Deferred` object or take a user-supplied
function that returns a :class:`~twisted.internet.defer.Deferred` object. These
APIs are also asynchronous but don't yet support native ``async def`` syntax.
In the future we plan to add support for the ``async def`` syntax to these APIs
or replace them with other APIs where changing the existing ones isn't
possible.

These APIs have a coroutine-based implementation and a Deferred-based one:

-   :class:`scrapy.crawler.Crawler`:

    - :meth:`~scrapy.crawler.Crawler.crawl_async` (coroutine-based) and
      :meth:`~scrapy.crawler.Crawler.crawl` (Deferred-based): the former
      may be inconvenient to use in Deferred-based code so both are available,
      this may change in a future Scrapy version.

-   :class:`scrapy.crawler.AsyncCrawlerRunner` and its subclass
    :class:`scrapy.crawler.AsyncCrawlerProcess` (coroutine-based) and
    :class:`scrapy.crawler.CrawlerRunner` and its subclass
    :class:`scrapy.crawler.CrawlerProcess` (Deferred-based): the former
    doesn't support non-default reactors and so the latter should be used
    with those.

The following user-supplied methods can return
:class:`~twisted.internet.defer.Deferred` objects (the methods that can also
return coroutines are listed in :ref:`coroutine-support`):

-   Custom downloader implementations (see :setting:`DOWNLOADER`):

    - ``fetch()``

-   Custom scheduler implementations (see :setting:`SCHEDULER`):

    - :meth:`~scrapy.core.scheduler.BaseScheduler.open`

    - :meth:`~scrapy.core.scheduler.BaseScheduler.close`

-   Custom dupefilters (see :setting:`DUPEFILTER_CLASS`):

    - ``open()``

    - ``close()``

-   Custom feed storages (see :setting:`FEED_STORAGES`):

    - ``store()``

-   Subclasses of :class:`scrapy.pipelines.media.MediaPipeline`:

    - ``media_to_download()``

    - ``item_completed()``

-   Custom storages used by subclasses of
    :class:`scrapy.pipelines.files.FilesPipeline`:

    - ``persist_file()``

    - ``stat_file()``

In most cases you can use these APIs in code that otherwise uses coroutines, by
wrapping a :class:`~twisted.internet.defer.Deferred` object into a
:class:`~asyncio.Future` object or vice versa. See :ref:`asyncio-await-dfd` for
more information about this.

For example: a custom scheduler needs to define an ``open()`` method that can
return a :class:`~twisted.internet.defer.Deferred` object. You can write a
method that works with Deferreds and returns one directly, or you can write a
coroutine and convert it into a function that returns a Deferred with
:func:`~scrapy.utils.defer.deferred_f_from_coro_f`.

General usage
=============

There are several use cases for coroutines in Scrapy.

Code that would return Deferreds when written for previous Scrapy versions,
such as downloader middlewares and signal handlers, can be rewritten to be
shorter and cleaner:

.. code-block:: python

    from itemadapter import ItemAdapter

    class DbPipeline:
        def _update_item(self, data, item):
            adapter = ItemAdapter(item)
            adapter["field"] = data
            return item

        def process_item(self, item):
            adapter = ItemAdapter(item)
            dfd = db.get_some_data(adapter["id"])
            dfd.addCallback(self._update_item, item)
            return dfd

becomes:

.. code-block:: python

    from itemadapter import ItemAdapter

    class DbPipeline:
        async def process_item(self, item):
            adapter = ItemAdapter(item)
            adapter["field"] = await db.get_some_data(adapter["id"])
            return item

Coroutines may be used to call asynchronous code. This includes other
coroutines, functions that return Deferreds and functions that return
:term:`awaitable objects <awaitable>` such as :class:`~asyncio.Future`.
This means you can use many useful Python libraries providing such code:

.. skip: next
.. code-block:: python

    class MySpiderDeferred(Spider):
        # ...
        async def parse(self, response):
            additional_response = await treq.get("https://additional.url")
            additional_data = await treq.content(additional_response)
            # ... use response and additional_data to yield items and requests

    class MySpiderAsyncio(Spider):
        # ...
        async def parse(self, response):
            async with aiohttp.ClientSession() as session:
                async with session.get("https://additional.url") as additional_response:
                    additional_data = await additional_response.text()
            # ... use response and additional_data to yield items and requests

.. note:: Many libraries that use coroutines, such as `aio-libs`_, require the
          :mod:`asyncio` loop and to use them you need to
          :doc:`enable asyncio support in Scrapy<asyncio>`.

.. note:: If you want to ``await`` on Deferreds while using the asyncio reactor,
          you need to :ref:`wrap them<asyncio-await-dfd>`.

Common use cases for asynchronous code include:

* requesting data from websites, databases and other services (in
  :meth:`~scrapy.spiders.Spider.start`, callbacks, pipelines and
  middlewares);
* storing data in databases (in pipelines and middlewares);
* delaying the spider initialization until some external event (in the
  :signal:`spider_opened` handler);
* calling asynchronous Scrapy methods like :meth:`ExecutionEngine.download`
  (see :ref:`the screenshot pipeline example<ScreenshotPipeline>`).

.. _aio-libs: https://github.com/aio-libs

.. _inline-requests:

Inline requests
===============

The spider below shows how to send a request and await its response all from
within a spider callback:

.. code-block:: python

    from scrapy import Spider, Request

    class SingleRequestSpider(Spider):
        name = "single"
        start_urls = ["https://example.org/product"]

        async def parse(self, response, **kwargs):
            additional_request = Request("https://example.org/price")
            additional_response = await self.crawler.engine.download_async(
                additional_request
            )
            yield {
                "h1": response.css("h1").get(),
                "price": additional_response.css("#price").get(),
            }

You can also send multiple requests in parallel:

.. code-block:: python

    import asyncio

    from scrapy import Spider, Request

    class MultipleRequestsSpider(Spider):
        name = "multiple"
        start_urls = ["https://example.com/product"]

        async def parse(self, response, **kwargs):
            additional_requests = [
                Request("https://example.com/price"),
                Request("https://example.com/color"),
            ]
            tasks = []
            for r in additional_requests:
                task = self.crawler.engine.download_async(r)
                tasks.append(task)
            responses = await asyncio.gather(*tasks)
            yield {
                "h1": response.css("h1::text").get(),
                "price": responses[0][1].css(".price::text").get(),
                "price2": responses[1][1].css(".color::text").get(),
            }

.. _sync-async-spider-middleware:

Mixing synchronous and asynchronous spider middlewares
======================================================

The output of a :class:`~scrapy.Request` callback is passed as the ``result``
parameter to the
:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output` method
of the first :ref:`spider middleware <topics-spider-middleware>` from the
:ref:`list of active spider middlewares <topics-spider-middleware-setting>`.
Then the output of that ``process_spider_output`` method is passed to the
``process_spider_output`` method of the next spider middleware, and so on for
every active spider middleware.

Scrapy supports mixing :ref:`coroutine methods <async>` and synchronous methods
in this chain of calls.

However, if any of the ``process_spider_output`` methods is defined as a
synchronous method, and the previous ``Request`` callback or
``process_spider_output`` method is a coroutine, there are some drawbacks to
the asynchronous-to-synchronous conversion that Scrapy does so that the
synchronous ``process_spider_output`` method gets a synchronous iterable as its
``result`` parameter:

-   The whole output of the previous ``Request`` callback or
    ``process_spider_output`` method is awaited at this point.

-   If an exception raises while awaiting the output of the previous
    ``Request`` callback or ``process_spider_output`` method, none of that
    output will be processed.

    This contrasts with the regular behavior, where all items yielded before
    an exception raises are processed.

Asynchronous-to-synchronous conversions are supported for backward
compatibility, but they are deprecated and will stop working in a future
version of Scrapy.

To avoid asynchronous-to-synchronous conversions, when defining ``Request``
callbacks as coroutine methods or when using spider middlewares whose
``process_spider_output`` method is an :term:`asynchronous generator`, all
active spider middlewares must either have their ``process_spider_output``
method defined as an asynchronous generator or :ref:`define a
process_spider_output_async method <universal-spider-middleware>`.

.. _sync-async-spider-middleware-users:

For middleware users
--------------------

If you have asynchronous callbacks or use asynchronous-only spider middlewares
you should make sure the asynchronous-to-synchronous conversions
:ref:`described above <sync-async-spider-middleware>` don't happen. To do this,
make sure all spider middlewares you use support asynchronous spider output.
Even if you don't have asynchronous callbacks and don't use asynchronous-only
spider middlewares in your project, it's still a good idea to make sure all
middlewares you use support asynchronous spider output, so that it will be easy
to start using asynchronous callbacks in the future. Because of this, Scrapy
logs a warning when it detects a synchronous-only spider middleware.

If you want to update middlewares you wrote, see the :ref:`following section
<sync-async-spider-middleware-authors>`. If you have 3rd-party middlewares that
aren't yet updated by their authors, you can :ref:`subclass <tut-inheritance>`
them to make them :ref:`universal <universal-spider-middleware>` and use the
subclasses in your projects.

.. _sync-async-spider-middleware-authors:

For middleware authors
----------------------

If you have a spider middleware that defines a synchronous
``process_spider_output`` method, you should update it to support asynchronous
spider output for :ref:`better compatibility <sync-async-spider-middleware>`,
even if you don't yet use it with asynchronous callbacks, especially if you
publish this middleware for other people to use. You have two options for this:

1. Make the middleware asynchronous, by making the ``process_spider_output``
   method an :term:`asynchronous generator`.
2. Make the middleware universal, as described in the :ref:`next section
   <universal-spider-middleware>`.

If your middleware won't be used in projects with synchronous-only middlewares,
e.g. because it's an internal middleware and you know that all other
middlewares in your projects are already updated, it's safe to choose the first
option. Otherwise, it's better to choose the second option.

.. _universal-spider-middleware:

Universal spider middlewares
----------------------------

To allow writing a spider middleware that supports asynchronous execution of
its ``process_spider_output`` method in Scrapy 2.7 and later (avoiding
:ref:`asynchronous-to-synchronous conversions <sync-async-spider-middleware>`)
while maintaining support for older Scrapy versions, you may define
``process_spider_output`` as a synchronous method and define an
:term:`asynchronous generator` version of that method with an alternative name:
``process_spider_output_async``.

For example:

.. code-block:: python

    class UniversalSpiderMiddleware:
        def process_spider_output(self, response, result):
            for r in result:
                # ... do something with r
                yield r

        async def process_spider_output_async(self, response, result):
            async for r in result:
                # ... do something with r
                yield r

.. note:: This is an interim measure to allow, for a time, to write code that
          works in Scrapy 2.7 and later without requiring
          asynchronous-to-synchronous conversions, and works in earlier Scrapy
          versions as well.

          In some future version of Scrapy, however, this feature will be
          deprecated and, eventually, in a later version of Scrapy, this
          feature will be removed, and all spider middlewares will be expected
          to define their ``process_spider_output`` method as an asynchronous
          generator.

Since 2.13.0, Scrapy provides a base class,
:class:`~scrapy.spidermiddlewares.base.BaseSpiderMiddleware`, which implements
the ``process_spider_output()`` and ``process_spider_output_async()`` methods,
so instead of duplicating the processing code you can override the
``get_processed_request()`` and/or the ``get_processed_item()`` method.


.. _topics-debug:

=================
Debugging Spiders
=================

This document explains the most common techniques for debugging spiders.
Consider the following Scrapy spider below:

.. skip: next
.. code-block:: python

    import scrapy
    from myproject.items import MyItem

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = (
            "http://example.com/page1",
            "http://example.com/page2",
        )

        def parse(self, response):
            # <processing code not shown>
            # collect `item_urls`
            for item_url in item_urls:
                yield scrapy.Request(item_url, self.parse_item)

        def parse_item(self, response):
            # <processing code not shown>
            item = MyItem()
            # populate `item` fields
            # and extract item_details_url
            yield scrapy.Request(
                item_details_url, self.parse_details, cb_kwargs={"item": item}
            )

        def parse_details(self, response, item):
            # populate more `item` fields
            return item

Basically this is a simple spider which parses two pages of items (the
start_urls). Items also have a details page with additional information, so we
use the ``cb_kwargs`` functionality of :class:`~scrapy.Request` to pass a
partially populated item.

Parse Command
=============

The most basic way of checking the output of your spider is to use the
:command:`parse` command. It allows to check the behaviour of different parts
of the spider at the method level. It has the advantage of being flexible and
simple to use, but does not allow debugging code inside a method.

.. highlight:: none

.. skip: start

In order to see the item scraped from a specific url::

    $ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
    [ ... scrapy log lines crawling example.com spider ... ]

    >>> STATUS DEPTH LEVEL 2 <<<
    # Scraped Items  ------------------------------------------------------------
    [{'url': <item_url>}]

    # Requests  -----------------------------------------------------------------
    []

Using the ``--verbose`` or ``-v`` option we can see the status at each depth level::

    $ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
    [ ... scrapy log lines crawling example.com spider ... ]

    >>> DEPTH LEVEL: 1 <<<
    # Scraped Items  ------------------------------------------------------------
    []

    # Requests  -----------------------------------------------------------------
    [<GET item_details_url>]

    >>> DEPTH LEVEL: 2 <<<
    # Scraped Items  ------------------------------------------------------------
    [{'url': <item_url>}]

    # Requests  -----------------------------------------------------------------
    []

Checking items scraped from a single start_url, can also be easily achieved
using::

    $ scrapy parse --spider=myspider -d 3 'http://example.com/page1'

.. skip: end

Scrapy Shell
============

While the :command:`parse` command is very useful for checking behaviour of a
spider, it is of little help to check what happens inside a callback, besides
showing the response received and the output. How to debug the situation when
``parse_details`` sometimes receives no item?

.. highlight:: python

Fortunately, the :command:`shell` is your bread and butter in this case (see
:ref:`topics-shell-inspect-response`):

.. code-block:: python

    from scrapy.shell import inspect_response

    def parse_details(self, response, item=None):
        if item:
            # populate more `item` fields
            return item
        else:
            inspect_response(response, self)

See also: :ref:`topics-shell-inspect-response`.

Open in browser
===============

Sometimes you just want to see how a certain response looks in a browser, you
can use the :func:`~scrapy.utils.response.open_in_browser` function for that:

.. autofunction:: scrapy.utils.response.open_in_browser

Logging
=======

Logging is another useful option for getting information about your spider run.
Although not as convenient, it comes with the advantage that the logs will be
available in all future runs should they be necessary again:

.. code-block:: python

    def parse_details(self, response, item=None):
        if item:
            # populate more `item` fields
            return item
        else:
            self.logger.warning("No item received for %s", response.url)

For more information, check the :ref:`topics-logging` section.

.. _debug-vscode:

Visual Studio Code
==================

.. highlight:: json

To debug spiders with Visual Studio Code you can use the following ``launch.json``::

    {
        "version": "0.1.0",
        "configurations": [
            {
                "name": "Python: Launch Scrapy Spider",
                "type": "python",
                "request": "launch",
                "module": "scrapy",
                "args": [
                    "runspider",
                    "${file}"
                ],
                "console": "integratedTerminal"
            }
        ]
    }

Also, make sure you enable "User Uncaught Exceptions", to catch exceptions in
your Scrapy spider.


.. _topics-deploy:

=================
Deploying Spiders
=================

This section describes the different options you have for deploying your Scrapy
spiders to run them on a regular basis. Running Scrapy spiders in your local
machine is very convenient for the (early) development stage, but not so much
when you need to execute long-running spiders or move spiders to run in
production continuously. This is where the solutions for deploying Scrapy
spiders come in.

Popular choices for deploying Scrapy spiders are:

* :ref:`Scrapyd <deploy-scrapyd>` (open source)
* :ref:`Zyte Scrapy Cloud <deploy-scrapy-cloud>` (cloud-based)

.. _deploy-scrapyd:

Deploying to a Scrapyd Server
=============================

`Scrapyd`_ is an open source application to run Scrapy spiders. It provides
a server with HTTP API, capable of running and monitoring Scrapy spiders.

To deploy spiders to Scrapyd, you can use the scrapyd-deploy tool provided by
the `scrapyd-client`_ package. Please refer to the `scrapyd-deploy
documentation`_ for more information.

Scrapyd is maintained by some of the Scrapy developers.

.. _deploy-scrapy-cloud:

Deploying to Zyte Scrapy Cloud
==============================

`Zyte Scrapy Cloud`_ is a hosted, cloud-based service by Zyte_, the company
behind Scrapy.

Zyte Scrapy Cloud removes the need to setup and monitor servers and provides a
nice UI to manage spiders and review scraped items, logs and stats.

To deploy spiders to Zyte Scrapy Cloud you can use the `shub`_ command line
tool.
Please refer to the `Zyte Scrapy Cloud documentation`_ for more information.

Zyte Scrapy Cloud is compatible with Scrapyd and one can switch between
them as needed - the configuration is read from the ``scrapy.cfg`` file
just like ``scrapyd-deploy``.

.. _Deploying your project: https://scrapyd.readthedocs.io/en/latest/deploy.html
.. _Scrapyd: https://github.com/scrapy/scrapyd
.. _scrapyd-client: https://github.com/scrapy/scrapyd-client
.. _scrapyd-deploy documentation: https://scrapyd.readthedocs.io/en/latest/deploy.html
.. _shub: https://shub.readthedocs.io/en/latest/
.. _Zyte: https://www.zyte.com/
.. _Zyte Scrapy Cloud: https://www.zyte.com/scrapy-cloud/
.. _Zyte Scrapy Cloud documentation: https://docs.zyte.com/scrapy-cloud.html


.. _topics-developer-tools:

=================================================
Using your browser's Developer Tools for scraping
=================================================

Here is a general guide on how to use your browser's Developer Tools
to ease the scraping process. Today almost all browsers come with
built in `Developer Tools`_ and although we will use Firefox in this
guide, the concepts are applicable to any other browser.

In this guide we'll introduce the basic tools to use from a browser's
Developer Tools by scraping `quotes.toscrape.com`_.

.. _topics-livedom:

Caveats with inspecting the live browser DOM
============================================

Since Developer Tools operate on a live browser DOM, what you'll actually see
when inspecting the page source is not the original HTML, but a modified one
after applying some browser clean up and executing JavaScript code.  Firefox,
in particular, is known for adding ``<tbody>`` elements to tables.  Scrapy, on
the other hand, does not modify the original page HTML, so you won't be able to
extract any data if you use ``<tbody>`` in your XPath expressions.

Therefore, you should keep in mind the following things:

* Disable JavaScript while inspecting the DOM looking for XPaths to be
  used in Scrapy (in the Developer Tools settings click `Disable JavaScript`)

* Never use full XPath paths, use relative and clever ones based on attributes
  (such as ``id``, ``class``, ``width``, etc) or any identifying features like
  ``contains(@href, 'image')``.

* Never include ``<tbody>`` elements in your XPath expressions unless you
  really know what you're doing

.. _topics-inspector:

Inspecting a website
====================

By far the most handy feature of the Developer Tools is the `Inspector`
feature, which allows you to inspect the underlying HTML code of
any webpage. To demonstrate the Inspector, let's look at the
`quotes.toscrape.com`_-site.

On the site we have a total of ten quotes from various authors with specific
tags, as well as the Top Ten Tags. Let's say we want to extract all the quotes
on this page, without any meta-information about authors, tags, etc.

Instead of viewing the whole source code for the page, we can simply right click
on a quote and select ``Inspect Element (Q)``, which opens up the `Inspector`.
In it you should see something like this:

.. image:: https://Scrapy.readthedocs.io/en/latest/_images/inspector_01.png
   :width: 777
   :height: 469
   :alt: Firefox's Inspector-tool

The interesting part for us is this:

.. code-block:: html

    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">(...)</span>
      <span>(...)</span>
      <div class="tags">(...)</div>
    </div>

If you hover over the first ``div`` directly above the ``span`` tag highlighted
in the screenshot, you'll see that the corresponding section of the webpage gets
highlighted as well. So now we have a section, but we can't find our quote text
anywhere.

The advantage of the `Inspector` is that it automatically expands and collapses
sections and tags of a webpage, which greatly improves readability. You can
expand and collapse a tag by clicking on the arrow in front of it or by double
clicking directly on the tag. If we expand the ``span`` tag with the ``class=
"text"`` we will see the quote-text we clicked on. The `Inspector` lets you
copy XPaths to selected elements. Let's try it out.

First open the Scrapy shell at https://quotes.toscrape.com/ in a terminal:

.. code-block:: none

    $ scrapy shell "https://quotes.toscrape.com/"

Then, back to your web browser, right-click on the ``span`` tag, select
``Copy > XPath`` and paste it in the Scrapy shell like so:

.. invisible-code-block: python

    response = load_response('https://quotes.toscrape.com/', 'quotes.html')

.. code-block:: pycon

  >>> response.xpath("/html/body/div/div[2]/div[1]/div[1]/span[1]/text()").getall()
  ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

Adding ``text()`` at the end we are able to extract the first quote with this
basic selector. But this XPath is not really that clever. All it does is
go down a desired path in the source code starting from ``html``. So let's
see if we can refine our XPath a bit:

If we check the `Inspector` again we'll see that directly beneath our
expanded ``div`` tag we have nine identical ``div`` tags, each with the
same attributes as our first. If we expand any of them, we'll see the same
structure as with our first quote: Two ``span`` tags and one ``div`` tag. We can
expand each ``span`` tag with the ``class="text"`` inside our ``div`` tags and
see each quote:

.. code-block:: html

    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
        “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>(...)</span>
      <div class="tags">(...)</div>
    </div>

With this knowledge we can refine our XPath: Instead of a path to follow,
we'll simply select all ``span`` tags with the ``class="text"`` by using
the `has-class-extension`_:

.. code-block:: pycon

    >>> response.xpath('//span[has-class("text")]/text()').getall()
    ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
    ...]

And with one simple, cleverer XPath we are able to extract all quotes from
the page. We could have constructed a loop over our first XPath to increase
the number of the last ``div``, but this would have been unnecessarily
complex and by simply constructing an XPath with ``has-class("text")``
we were able to extract all quotes in one line.

The `Inspector` has a lot of other helpful features, such as searching in the
source code or directly scrolling to an element you selected. Let's demonstrate
a use case:

Say you want to find the ``Next`` button on the page. Type ``Next`` into the
search bar on the top right of the `Inspector`. You should get two results.
The first is a ``li`` tag with the ``class="next"``, the second the text
of an ``a`` tag. Right click on the ``a`` tag and select ``Scroll into View``.
If you hover over the tag, you'll see the button highlighted. From here
we could easily create a :ref:`Link Extractor <topics-link-extractors>` to
follow the pagination. On a simple site such as this, there may not be
the need to find an element visually but the ``Scroll into View`` function
can be quite useful on complex sites.

Note that the search bar can also be used to search for and test CSS
selectors. For example, you could search for ``span.text`` to find
all quote texts. Instead of a full text search, this searches for
exactly the ``span`` tag with the ``class="text"`` in the page.

.. _topics-network-tool:

The Network-tool
================
While scraping you may come across dynamic webpages where some parts
of the page are loaded dynamically through multiple requests. While
this can be quite tricky, the `Network`-tool in the Developer Tools
greatly facilitates this task. To demonstrate the Network-tool, let's
take a look at the page `quotes.toscrape.com/scroll`_.

The page is quite similar to the basic `quotes.toscrape.com`_-page,
but instead of the above-mentioned ``Next`` button, the page
automatically loads new quotes when you scroll to the bottom. We
could go ahead and try out different XPaths directly, but instead
we'll check another quite useful command from the Scrapy shell:

.. skip: next

.. code-block:: none

  $ scrapy shell "quotes.toscrape.com/scroll"
  (...)
  >>> view(response)

A browser window should open with the webpage but with one
crucial difference: Instead of the quotes we just see a greenish
bar with the word ``Loading...``.

.. image:: https://Scrapy.readthedocs.io/en/latest/_images/network_01.png
   :width: 777
   :height: 296
   :alt: Response from quotes.toscrape.com/scroll

The ``view(response)`` command let's us view the response our
shell or later our spider receives from the server. Here we see
that some basic template is loaded which includes the title,
the login-button and the footer, but the quotes are missing. This
tells us that the quotes are being loaded from a different request
than ``quotes.toscrape/scroll``.

If you click on the ``Network`` tab, you will probably only see
two entries. The first thing we do is enable persistent logs by
clicking on ``Persist Logs``. If this option is disabled, the
log is automatically cleared each time you navigate to a different
page. Enabling this option is a good default, since it gives us
control on when to clear the logs.

If we reload the page now, you'll see the log get populated with six
new requests.

.. image:: https://Scrapy.readthedocs.io/en/latest/_images/network_02.png
   :width: 777
   :height: 241
   :alt: Network tab with persistent logs and requests

Here we see every request that has been made when reloading the page
and can inspect each request and its response. So let's find out
where our quotes are coming from:

First click on the request with the name ``scroll``. On the right
you can now inspect the request. In ``Headers`` you'll find details
about the request headers, such as the URL, the method, the IP-address,
and so on. We'll ignore the other tabs and click directly on ``Response``.

What you should see in the ``Preview`` pane is the rendered HTML-code,
that is exactly what we saw when we called ``view(response)`` in the
shell. Accordingly the ``type`` of the request in the log is ``html``.
The other requests have types like ``css`` or ``js``, but what
interests us is the one request called ``quotes?page=1`` with the
type ``json``.

If we click on this request, we see that the request URL is
``https://quotes.toscrape.com/api/quotes?page=1`` and the response
is a JSON-object that contains our quotes. We can also right-click
on the request and open ``Open in new tab`` to get a better overview.

.. image:: https://Scrapy.readthedocs.io/en/latest/_images/network_03.png
   :width: 777
   :height: 375
   :alt: JSON-object returned from the quotes.toscrape API

With this response we can now easily parse the JSON-object and
also request each page to get every quote on the site:

.. code-block:: python

    import scrapy
    import json

    class QuoteSpider(scrapy.Spider):
        name = "quote"
        allowed_domains = ["quotes.toscrape.com"]
        page = 1
        start_urls = ["https://quotes.toscrape.com/api/quotes?page=1"]

        def parse(self, response):
            data = json.loads(response.text)
            for quote in data["quotes"]:
                yield {"quote": quote["text"]}
            if data["has_next"]:
                self.page += 1
                url = f"https://quotes.toscrape.com/api/quotes?page={self.page}"
                yield scrapy.Request(url=url, callback=self.parse)

This spider starts at the first page of the quotes-API. With each
response, we parse the ``response.text`` and assign it to ``data``.
This lets us operate on the JSON-object like on a Python dictionary.
We iterate through the ``quotes`` and print out the ``quote["text"]``.
If the handy ``has_next`` element is ``true`` (try loading
`quotes.toscrape.com/api/quotes?page=10`_ in your browser or a
page-number greater than 10), we increment the ``page`` attribute
and ``yield`` a new request, inserting the incremented page-number
into our ``url``.

.. _requests-from-curl:

In more complex websites, it could be difficult to easily reproduce the
requests, as we could need to add ``headers`` or ``cookies`` to make it work.
In those cases you can export the requests in `cURL <https://curl.se/>`_
format, by right-clicking on each of them in the network tool and using the
:meth:`~scrapy.Request.from_curl` method to generate an equivalent
request:

.. code-block:: python

    from scrapy import Request

    request = Request.from_curl(
        "curl 'https://quotes.toscrape.com/api/quotes?page=1' -H 'User-Agent: Mozil"
        "la/5.0 (X11; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0' -H 'Acce"
        "pt: */*' -H 'Accept-Language: ca,en-US;q=0.7,en;q=0.3' --compressed -H 'X"
        "-Requested-With: XMLHttpRequest' -H 'Proxy-Authorization: Basic QFRLLTAzM"
        "zEwZTAxLTk5MWUtNDFiNC1iZWRmLTJjNGI4M2ZiNDBmNDpAVEstMDMzMTBlMDEtOTkxZS00MW"
        "I0LWJlZGYtMmM0YjgzZmI0MGY0' -H 'Connection: keep-alive' -H 'Referer: http"
        "://quotes.toscrape.com/scroll' -H 'Cache-Control: max-age=0'"
    )

Alternatively, if you want to know the arguments needed to recreate that
request you can use the :func:`~scrapy.utils.curl.curl_to_request_kwargs`
function to get a dictionary with the equivalent arguments:

.. autofunction:: scrapy.utils.curl.curl_to_request_kwargs

Note that to translate a cURL command into a Scrapy request,
you may use `curl2scrapy <https://michael-shub.github.io/curl2scrapy/>`_.

As you can see, with a few inspections in the `Network`-tool we
were able to easily replicate the dynamic requests of the scrolling
functionality of the page. Crawling dynamic pages can be quite
daunting and pages can be very complex, but it (mostly) boils down
to identifying the correct request and replicating it in your spider.

.. _Developer Tools: https://en.wikipedia.org/wiki/Web_development_tools
.. _quotes.toscrape.com: https://quotes.toscrape.com
.. _quotes.toscrape.com/scroll: https://quotes.toscrape.com/scroll
.. _quotes.toscrape.com/api/quotes?page=10: https://quotes.toscrape.com/api/quotes?page=10
.. _has-class-extension: https://parsel.readthedocs.io/en/latest/usage.html#other-xpath-extensions


:orphan:

.. _topics-djangoitem:

==========
DjangoItem
==========

DjangoItem has been moved into a separate project.

It is hosted at:

    https://github.com/scrapy-plugins/scrapy-djangoitem


.. _topics-download-handlers:

=================
Download handlers
=================

Download handlers are Scrapy :ref:`components <topics-components>` used to
download :ref:`requests <topics-request-response>` and produce responses from
them.

Using download handlers
=======================

The :setting:`DOWNLOAD_HANDLERS_BASE` and :setting:`DOWNLOAD_HANDLERS` settings
tell Scrapy which handler is responsible for a given URL scheme. Their values
are merged into a mapping from scheme names to handler classes. When Scrapy
initializes it creates instances of all configured download handlers (except
for :ref:`lazy ones <lazy-download-handlers>`) and stores them in a similar
mapping. When Scrapy needs to download a request it extracts the scheme from
its URL, finds the handler for this scheme, passes the request to it and gets a
response from it.  If there is no handler for the scheme, the request is not
downloaded and a :exc:`~scrapy.exceptions.NotSupported` exception is raised.

The :setting:`DOWNLOAD_HANDLERS_BASE` setting contains the default mapping of
handlers. You can use the :setting:`DOWNLOAD_HANDLERS` setting to add handlers
for additional schemes and to replace or disable default ones:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        # disable support for ftp:// requests
        "ftp": None,
        # replace the default one for http://
        "http": "my.download_handlers.HttpHandler",
        # http:// and https:// are different schemes,
        # even though they may use the same handler
        "https": "my.download_handlers.HttpHandler",
        # support for any custom scheme can be added
        "sftp": "my.download_handlers.SftpHandler",
    }

Replacing HTTP(S) download handlers
-----------------------------------

While Scrapy provides a default handler for ``http`` and ``https`` schemes,
users may want to use a different handler, provided by Scrapy or by some
3rd-party package. There are several considerations to keep in mind related to
this.

First of all, as ``http`` and ``https`` are separate schemes, they need
separate entries in the :setting:`DOWNLOAD_HANDLERS` setting, even though it's
likely that the same handler class will be used for both schemes.

Additionally, some of the Scrapy settings, like :setting:`DOWNLOAD_MAXSIZE`,
are honored by the default HTTP(S) handler but not necessarily by alternative
ones. The same may apply to other Scrapy features, e.g. the
:signal:`bytes_received` and :signal:`headers_received` signals.

.. _lazy-download-handlers:

Lazy instantiation of download handlers
---------------------------------------

A download handler can be marked as "lazy" by setting its ``lazy`` class
attribute to ``True``. Such handlers are only instantiated when they need to
download their first request. This may be useful when the instantiation is slow
or requires dependencies that are not always available, and the handler is not
needed on every spider run. For example, :class:`the built-in S3 handler
<.S3DownloadHandler>` is lazy.

Writing your own download handler
=================================

A download handler is a :ref:`component <topics-components>` that defines
the following API:

.. class:: SampleDownloadHandler

    .. attribute:: lazy
        :type: bool

        If ``False``, the handler will be instantiated when Scrapy is
        initialized.

        If ``True``, the handler will only be instantiated when the first
        request handled by it needs to be downloaded.

    .. method:: download_request(request: Request) -> Response:
        :async:

        Download the given request and return a response.

    .. method:: close() -> None
        :async:

        Clean up any resources used by the handler.

An optional base class for custom handlers is provided:

.. autoclass:: scrapy.core.downloader.handlers.base.BaseDownloadHandler
    :members:
    :undoc-members:
    :member-order: bysource

.. _download-handlers-exceptions:

Exceptions raised by download handlers
======================================

.. versionadded:: 2.15.0

The built-in download handlers raise Scrapy-specific exceptions instead of
implementation-specific ones, so that code that handles these exceptions can be
written in a generic way. We recommend custom download handlers to also use
these exceptions.

.. autoexception:: scrapy.exceptions.CannotResolveHostError

.. autoexception:: scrapy.exceptions.DownloadCancelledError

.. autoexception:: scrapy.exceptions.DownloadConnectionRefusedError

.. autoexception:: scrapy.exceptions.DownloadFailedError

.. autoexception:: scrapy.exceptions.DownloadTimeoutError

.. autoexception:: scrapy.exceptions.ResponseDataLossError

.. autoexception:: scrapy.exceptions.UnsupportedURLSchemeError

.. _download-handlers-ref:

Built-in download handlers reference
====================================

DataURIDownloadHandler
----------------------

.. autoclass:: scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler

| Supported scheme: ``data``.
| Lazy: no.

This handler supports RFC 2397 ``data:content/type;base64,`` data URIs.

FileDownloadHandler
-------------------

.. autoclass:: scrapy.core.downloader.handlers.file.FileDownloadHandler

| Supported scheme: ``file``.
| Lazy: no.

This handler supports ``file:///path`` local file URIs. It doesn't
support remote files.

FTPDownloadHandler
------------------

.. autoclass:: scrapy.core.downloader.handlers.ftp.FTPDownloadHandler

| Supported scheme: ``ftp``.
| Lazy: no.

This handler supports ``ftp://host/path`` FTP URIs.

It's implemented using :mod:`twisted.protocols.ftp`.

.. note::
    This handler is not supported when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

.. _twisted-http2-handler:

H2DownloadHandler
-----------------

.. autoclass:: scrapy.core.downloader.handlers.http2.H2DownloadHandler

| Supported scheme: ``https``.
| Lazy: yes.

This handler supports ``https://host/path`` URLs and uses the HTTP/2 protocol
for them.

It's implemented using :mod:`twisted.web.client` and the ``h2`` library.

For this handler to work you need to install the ``Twisted[http2]`` extra
dependency.

If you want to use this handler you need to replace the default one for the
``https`` scheme:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        "https": "scrapy.core.downloader.handlers.http2.H2DownloadHandler",
    }

.. warning::

    This handler is experimental, and not yet recommended for production
    environments. Future Scrapy versions may introduce related changes without
    a deprecation period or warning.

.. note::

    Known limitations of the HTTP/2 implementation in this handler include:

    -   No support for HTTP/2 Cleartext (h2c), since no major browser supports
        HTTP/2 unencrypted (refer `http2 faq`_).

    -   No setting to specify a maximum `frame size`_ larger than the default
        value, 16384. Connections to servers that send a larger frame will
        fail.

    -   No support for `server pushes`_, which are ignored.

    -   No support for the :signal:`bytes_received` and
        :signal:`headers_received` signals.

.. _frame size: https://datatracker.ietf.org/doc/html/rfc7540#section-4.2
.. _http2 faq: https://http2.github.io/faq/#does-http2-require-encryption
.. _server pushes: https://datatracker.ietf.org/doc/html/rfc7540#section-8.2

.. note::
    This handler is not supported when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

HTTP11DownloadHandler
---------------------

.. autoclass:: scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler

| Supported schemes: ``http``, ``https``.
| Lazy: no.

This handler supports ``http://host/path`` and ``https://host/path`` URLs and
uses the HTTP/1.1 protocol for them.

It's implemented using :mod:`twisted.web.client`.

.. note::
    This handler is not supported when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

HttpxDownloadHandler
--------------------

.. autoclass:: scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler

| Supported schemes: ``http``, ``https``.
| Lazy: no.

This handler supports ``http://host/path`` and ``https://host/path`` URLs and
uses the HTTP/1.1 protocol for them.

It's implemented using the ``httpx`` library and needs it to be installed.

If you want to use this handler you need to replace the default ones for the
``http`` and ``https`` schemes:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        "http": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
        "https": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
    }

.. warning::

    This handler is experimental, and not yet recommended for production
    environments. Future Scrapy versions may introduce related changes without
    a deprecation period or warning or even remove it altogether.

.. note::

    As this handler is based on a different HTTP client implementation compared
    to :class:`~.HTTP11DownloadHandler`, it's expected that its behavior on
    some websites may be different. Additionally, these are the Scrapy features
    that are explicitly not supported when using it:

    - Proxy support (the :reqmeta:`proxy` meta key).

    - Per-request bind address support (the :reqmeta:`bindaddress` meta key).
      The global :setting:`DOWNLOAD_BIND_ADDRESS` setting is supported but the
      port number, if specified, will be ignored.

    - The :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and
      :setting:`DOWNLOADER_CLIENT_TLS_METHOD` settings.

    - Settings specific to the Twisted networking or HTTP implementation, like
      :setting:`DNS_RESOLVER`.

    - Using :ref:`non-asyncio reactors <disable-asyncio>` (``httpx`` requires
      ``asyncio``).

S3DownloadHandler
-----------------

.. autoclass:: scrapy.core.downloader.handlers.s3.S3DownloadHandler

| Supported scheme: ``s3``.
| Lazy: yes.

This handler supports ``s3://bucket/path`` S3 URIs.

It's implemented using the ``botocore`` library and needs it to be installed.


.. _topics-downloader-middleware:

=====================
Downloader Middleware
=====================

The downloader middleware is a framework of hooks into Scrapy's
request/response processing.  It's a light, low-level system for globally
altering Scrapy's requests and responses.

.. _topics-downloader-middleware-setting:

Activating a downloader middleware
==================================

To activate a downloader middleware component, add it to the
:setting:`DOWNLOADER_MIDDLEWARES` setting, which is a dict whose keys are the
middleware class paths and their values are the middleware orders.

Here's an example:

.. code-block:: python

    DOWNLOADER_MIDDLEWARES = {
        "myproject.middlewares.CustomDownloaderMiddleware": 543,
    }

The :setting:`DOWNLOADER_MIDDLEWARES` setting is merged with the
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant
to be overridden) and then sorted by order to get the final sorted list of
enabled middlewares: the first middleware is the one closer to the engine and
the last is the one closer to the downloader. In other words,
the :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_request`
method of each middleware will be invoked in increasing
middleware order (100, 200, 300, ...) and the :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response` method
of each middleware will be invoked in decreasing order.

To decide which order to assign to your middleware see the
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting and pick a value according to
where you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.

If you want to disable a built-in middleware (the ones defined in
:setting:`DOWNLOADER_MIDDLEWARES_BASE` and enabled by default) you must define it
in your project's :setting:`DOWNLOADER_MIDDLEWARES` setting and assign ``None``
as its value.  For example, if you want to disable the user-agent middleware:

.. code-block:: python

    DOWNLOADER_MIDDLEWARES = {
        "myproject.middlewares.CustomDownloaderMiddleware": 543,
        "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    }

Finally, keep in mind that some middlewares may need to be enabled through a
particular setting. See each middleware documentation for more info.

.. _topics-downloader-middleware-custom:

Writing your own downloader middleware
======================================

Each downloader middleware is a :ref:`component <topics-components>` that
defines one or more of these methods:

.. module:: scrapy.downloadermiddlewares

.. class:: DownloaderMiddleware

   .. note::  Any of the downloader middleware methods may be defined as a
        coroutine function (``async def``).

   .. method:: process_request(request)

      This method is called for each request that goes through the download
      middleware.

      :meth:`process_request` should either: return ``None``, return a
      :class:`~scrapy.http.Response` object, return a :class:`~scrapy.Request`
      object, or raise :exc:`~scrapy.exceptions.IgnoreRequest`.

      If it returns ``None``, Scrapy will continue processing this request, executing all
      other middlewares until, finally, the appropriate downloader handler is called
      the request performed (and its response downloaded).

      If it returns a :class:`~scrapy.http.Response` object, Scrapy won't bother
      calling *any* other :meth:`process_request` or :meth:`process_exception` methods,
      or the appropriate download function; it'll return that response. The :meth:`process_response`
      methods of installed middleware is always called on every response.

      If it returns a :class:`~scrapy.Request` object, Scrapy will stop calling
      :meth:`process_request` methods and reschedule the returned request. Once the newly returned
      request is performed, the appropriate middleware chain will be called on
      the downloaded response.

      If it raises an :exc:`~scrapy.exceptions.IgnoreRequest` exception, the
      :meth:`process_exception` methods of installed downloader middleware will be called.
      If none of them handle the exception, the errback function of the request
      (``Request.errback``) is called. If no code handles the raised exception, it is
      ignored and not logged (unlike other exceptions).

      :param request: the request being processed
      :type request: :class:`~scrapy.Request` object

   .. method:: process_response(request, response)

      :meth:`process_response` should either: return a :class:`~scrapy.http.Response`
      object, return a :class:`~scrapy.Request` object or
      raise a :exc:`~scrapy.exceptions.IgnoreRequest` exception.

      If it returns a :class:`~scrapy.http.Response` (it could be the same given
      response, or a brand-new one), that response will continue to be processed
      with the :meth:`process_response` of the next middleware in the chain.

      If it returns a :class:`~scrapy.Request` object, the middleware chain is
      halted and the returned request is rescheduled to be downloaded in the future.
      This is the same behavior as if a request is returned from :meth:`process_request`.

      If it raises an :exc:`~scrapy.exceptions.IgnoreRequest` exception, the errback
      function of the request (``Request.errback``) is called. If no code handles the raised
      exception, it is ignored and not logged (unlike other exceptions).

      :param request: the request that originated the response
      :type request: is a :class:`~scrapy.Request` object

      :param response: the response being processed
      :type response: :class:`~scrapy.http.Response` object

   .. method:: process_exception(request, exception)

      Scrapy calls :meth:`process_exception` when a :ref:`download handler
      <topics-download-handlers>` or a :meth:`process_request` (from a
      downloader middleware) raises an exception (including an
      :exc:`~scrapy.exceptions.IgnoreRequest` exception).

      :meth:`process_exception` should return: either ``None``,
      a :class:`~scrapy.http.Response` object, or a :class:`~scrapy.Request` object.

      If it returns ``None``, Scrapy will continue processing this exception,
      executing any other :meth:`process_exception` methods of installed middleware,
      until no middleware is left and the default exception handling kicks in.

      If it returns a :class:`~scrapy.http.Response` object, the :meth:`process_response`
      method chain of installed middleware is started, and Scrapy won't bother calling
      any other :meth:`process_exception` methods of middleware.

      If it returns a :class:`~scrapy.Request` object, the returned request is
      rescheduled to be downloaded in the future. This stops the execution of
      :meth:`process_exception` methods of the middleware the same as returning a
      response would.

      :param request: the request that generated the exception
      :type request: is a :class:`~scrapy.Request` object

      :param exception: the raised exception
      :type exception: an ``Exception`` object

.. _topics-downloader-middleware-ref:

Built-in downloader middleware reference
========================================

This page describes all downloader middleware components that come with
Scrapy. For information on how to use them and how to write your own downloader
middleware, see the :ref:`downloader middleware usage guide
<topics-downloader-middleware>`.

For a list of the components enabled by default (and their orders) see the
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting.

.. _cookies-mw:

CookiesMiddleware
-----------------

.. module:: scrapy.downloadermiddlewares.cookies
   :synopsis: Cookies Downloader Middleware

.. class:: CookiesMiddleware

   This middleware enables working with sites that require cookies, such as
   those that use sessions. It keeps track of cookies sent by web servers, and
   sends them back on subsequent requests (from that spider), just like web
   browsers do.

   .. caution:: When non-UTF8 encoded byte sequences are passed to a
      :class:`~scrapy.Request`, the ``CookiesMiddleware`` will log
      a warning. Refer to :ref:`topics-logging-advanced-customization`
      to customize the logging behaviour.

   .. caution:: Cookies set via the ``Cookie`` header are not considered by the
      :ref:`cookies-mw`. If you need to set cookies for a request, use the
      :class:`Request.cookies <scrapy.Request>` parameter. This is a known
      current limitation that is being worked on.

The following settings can be used to configure the cookie middleware:

* :setting:`COOKIES_ENABLED`
* :setting:`COOKIES_DEBUG`

.. reqmeta:: cookiejar

Multiple cookie sessions per spider
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is support for keeping multiple cookie sessions per spider by using the
:reqmeta:`cookiejar` Request meta key. By default it uses a single cookie jar
(session), but you can pass an identifier to use different ones.

For example:

.. skip: next
.. code-block:: python

    for i, url in enumerate(urls):
        yield scrapy.Request(url, meta={"cookiejar": i}, callback=self.parse_page)

Keep in mind that the :reqmeta:`cookiejar` meta key is not "sticky". You need to keep
passing it along on subsequent requests. For example:

.. code-block:: python

    def parse_page(self, response):
        # do some processing
        return scrapy.Request(
            "http://www.example.com/otherpage",
            meta={"cookiejar": response.meta["cookiejar"]},
            callback=self.parse_other_page,
        )

.. setting:: COOKIES_ENABLED

COOKIES_ENABLED
~~~~~~~~~~~~~~~

Default: ``True``

Whether to enable the cookies middleware. If disabled, no cookies will be sent
to web servers.

Notice that despite the value of :setting:`COOKIES_ENABLED` setting if
``Request.``:reqmeta:`meta['dont_merge_cookies'] <dont_merge_cookies>`
evaluates to ``True`` the request cookies will **not** be sent to the
web server and received cookies in :class:`~scrapy.http.Response` will
**not** be merged with the existing cookies.

For more detailed information see the ``cookies`` parameter in
:class:`~scrapy.Request`.

.. setting:: COOKIES_DEBUG

COOKIES_DEBUG
~~~~~~~~~~~~~

Default: ``False``

If enabled, Scrapy will log all cookies sent in requests (i.e. ``Cookie``
header) and all cookies received in responses (i.e. ``Set-Cookie`` header).

Here's an example of a log with :setting:`COOKIES_DEBUG` enabled::

    2011-04-06 14:35:10-0300 [scrapy.core.engine] INFO: Spider opened
    2011-04-06 14:35:10-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.diningcity.com/netherlands/index.html>
            Cookie: clientlanguage_nl=en_EN
    2011-04-06 14:35:14-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 http://www.diningcity.com/netherlands/index.html>
            Set-Cookie: JSESSIONID=B~FA4DC0C496C8762AE4F1A620EAB34F38; Path=/
            Set-Cookie: ip_isocode=US
            Set-Cookie: clientlanguage_nl=en_EN; Expires=Thu, 07-Apr-2011 21:21:34 GMT; Path=/
    2011-04-06 14:49:50-0300 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.diningcity.com/netherlands/index.html> (referer: None)
    [...]

DefaultHeadersMiddleware
------------------------

.. module:: scrapy.downloadermiddlewares.defaultheaders
   :synopsis: Default Headers Downloader Middleware

.. class:: DefaultHeadersMiddleware

    This middleware sets all default requests headers specified in the
    :setting:`DEFAULT_REQUEST_HEADERS` setting.

DownloadTimeoutMiddleware
-------------------------

.. module:: scrapy.downloadermiddlewares.downloadtimeout
   :synopsis: Download timeout middleware

.. class:: DownloadTimeoutMiddleware

    This middleware sets the download timeout for requests specified in the
    :setting:`DOWNLOAD_TIMEOUT` setting.

.. note::

    You can also set download timeout per-request using the
    :reqmeta:`download_timeout` :attr:`.Request.meta` key; this is supported
    even when DownloadTimeoutMiddleware is disabled.

HttpAuthMiddleware
------------------

.. module:: scrapy.downloadermiddlewares.httpauth
   :synopsis: HTTP Auth downloader middleware

.. class:: HttpAuthMiddleware

    This middleware authenticates all requests generated from certain spiders
    using `Basic access authentication`_ (aka. HTTP auth).

    To enable HTTP authentication for a spider, set the ``http_user`` and
    ``http_pass`` spider attributes to the authentication data and the
    ``http_auth_domain`` spider attribute to the domain which requires this
    authentication (its subdomains will be also handled in the same way).
    You can set ``http_auth_domain`` to ``None`` to enable the
    authentication for all requests but you risk leaking your authentication
    credentials to unrelated domains.

    .. warning::
        In previous Scrapy versions HttpAuthMiddleware sent the authentication
        data with all requests, which is a security problem if the spider
        makes requests to several different domains. Currently if the
        ``http_auth_domain`` attribute is not set, the middleware will use the
        domain of the first request, which will work for some spiders but not
        for others. In the future the middleware will produce an error instead.

    Example:

    .. code-block:: python

        from scrapy.spiders import CrawlSpider

        class SomeIntranetSiteSpider(CrawlSpider):
            http_user = "someuser"
            http_pass = "somepass"
            http_auth_domain = "intranet.example.com"
            name = "intranet.example.com"

            # .. rest of the spider code omitted ...

.. _Basic access authentication: https://en.wikipedia.org/wiki/Basic_access_authentication

HttpCacheMiddleware
-------------------

.. module:: scrapy.downloadermiddlewares.httpcache
   :synopsis: HTTP Cache downloader middleware

.. class:: HttpCacheMiddleware

    This middleware provides low-level cache to all HTTP requests and responses.
    It has to be combined with a cache storage backend as well as a cache policy.

    Scrapy ships with the following HTTP cache storage backends:

        * :ref:`httpcache-storage-fs`
        * :ref:`httpcache-storage-dbm`

    You can change the HTTP cache storage backend with the :setting:`HTTPCACHE_STORAGE`
    setting. Or you can also :ref:`implement your own storage backend. <httpcache-storage-custom>`

    Scrapy ships with two HTTP cache policies:

        * :ref:`httpcache-policy-rfc2616`
        * :ref:`httpcache-policy-dummy`

    You can change the HTTP cache policy with the :setting:`HTTPCACHE_POLICY`
    setting. Or you can also implement your own policy.

    .. reqmeta:: dont_cache

    You can also avoid caching a response on every policy using :reqmeta:`dont_cache` meta key equals ``True``.

.. module:: scrapy.extensions.httpcache
   :noindex:

.. _httpcache-policy-dummy:

Dummy policy (default)
~~~~~~~~~~~~~~~~~~~~~~

.. class:: DummyPolicy

    This policy has no awareness of any HTTP Cache-Control directives.
    Every request and its corresponding response are cached.  When the same
    request is seen again, the response is returned without transferring
    anything from the Internet.

    The Dummy policy is useful for testing spiders faster (without having
    to wait for downloads every time) and for trying your spider offline,
    when an Internet connection is not available. The goal is to be able to
    "replay" a spider run *exactly as it ran before*.

.. _httpcache-policy-rfc2616:

RFC2616 policy
~~~~~~~~~~~~~~

.. class:: RFC2616Policy

    This policy provides a RFC2616 compliant HTTP cache, i.e. with HTTP
    Cache-Control awareness, aimed at production and used in continuous
    runs to avoid downloading unmodified data (to save bandwidth and speed up
    crawls).

    What is implemented:

    * Do not attempt to store responses/requests with ``no-store`` cache-control directive set
    * Do not serve responses from cache if ``no-cache`` cache-control directive is set even for fresh responses
    * Compute freshness lifetime from ``max-age`` cache-control directive
    * Compute freshness lifetime from ``Expires`` response header
    * Compute freshness lifetime from ``Last-Modified`` response header (heuristic used by Firefox)
    * Compute current age from ``Age`` response header
    * Compute current age from ``Date`` header
    * Revalidate stale responses based on ``Last-Modified`` response header
    * Revalidate stale responses based on ``ETag`` response header
    * Set ``Date`` header for any received response missing it
    * Support ``max-stale`` cache-control directive in requests

    This allows spiders to be configured with the full RFC2616 cache policy,
    but avoid revalidation on a request-by-request basis, while remaining
    conformant with the HTTP spec.

    Example:

    Add ``Cache-Control: max-stale=600`` to Request headers to accept responses that
    have exceeded their expiration time by no more than 600 seconds.

    See also: RFC2616, 14.9.3

    What is missing:

    * ``Pragma: no-cache`` support https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
    * ``Vary`` header support https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
    * Invalidation after updates or deletes https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10
    * ... probably others ..

.. _httpcache-storage-fs:

Filesystem storage backend (default)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. class:: FilesystemCacheStorage

    File system storage backend is available for the HTTP cache middleware.

    Each request/response pair is stored in a different directory containing
    the following files:

    *   ``request_body`` - the plain request body

    *   ``request_headers`` - the request headers (in raw HTTP format)

    *   ``response_body`` - the plain response body

    *   ``response_headers`` - the request headers (in raw HTTP format)

    *   ``meta`` - some metadata of this cache resource in Python ``repr()``
        format (grep-friendly format)

    *   ``pickled_meta`` - the same metadata in ``meta`` but pickled for more
        efficient deserialization

    The directory name is made from the request fingerprint (see
    ``scrapy.utils.request.fingerprint``), and one level of subdirectories is
    used to avoid creating too many files into the same directory (which is
    inefficient in many file systems). An example directory could be::

        /path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7

.. _httpcache-storage-dbm:

DBM storage backend
~~~~~~~~~~~~~~~~~~~

.. class:: DbmCacheStorage

    A DBM_ storage backend is also available for the HTTP cache middleware.

    By default, it uses the :mod:`dbm`, but you can change it with the
    :setting:`HTTPCACHE_DBM_MODULE` setting.

.. _httpcache-storage-custom:

Writing your own storage backend
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can implement a cache storage backend by creating a Python class that
defines the methods described below.

.. module:: scrapy.extensions.httpcache

.. class:: CacheStorage

    .. method:: open_spider(spider)

      This method gets called after a spider has been opened for crawling. It handles
      the :signal:`open_spider <spider_opened>` signal.

      :param spider: the spider which has been opened
      :type spider: :class:`~scrapy.Spider` object

    .. method:: close_spider(spider)

      This method gets called after a spider has been closed. It handles
      the :signal:`close_spider <spider_closed>` signal.

      :param spider: the spider which has been closed
      :type spider: :class:`~scrapy.Spider` object

    .. method:: retrieve_response(spider, request)

      Return response if present in cache, or ``None`` otherwise.

      :param spider: the spider which generated the request
      :type spider: :class:`~scrapy.Spider` object

      :param request: the request to find cached response for
      :type request: :class:`~scrapy.Request` object

    .. method:: store_response(spider, request, response)

      Store the given response in the cache.

      :param spider: the spider for which the response is intended
      :type spider: :class:`~scrapy.Spider` object

      :param request: the corresponding request the spider generated
      :type request: :class:`~scrapy.Request` object

      :param response: the response to store in the cache
      :type response: :class:`~scrapy.http.Response` object

In order to use your storage backend, set:

* :setting:`HTTPCACHE_STORAGE` to the Python import path of your custom storage class.

HTTPCache middleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :class:`HttpCacheMiddleware` can be configured through the following
settings:

.. setting:: HTTPCACHE_ENABLED

HTTPCACHE_ENABLED
^^^^^^^^^^^^^^^^^

Default: ``False``

Whether the HTTP cache will be enabled.

.. setting:: HTTPCACHE_EXPIRATION_SECS

HTTPCACHE_EXPIRATION_SECS
^^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``0``

Expiration time for cached requests, in seconds.

Cached requests older than this time will be re-downloaded. If zero, cached
requests will never expire.

.. setting:: HTTPCACHE_DIR

HTTPCACHE_DIR
^^^^^^^^^^^^^

Default: ``'httpcache'``

The directory to use for storing the (low-level) HTTP cache. If empty, the HTTP
cache will be disabled. If a relative path is given, is taken relative to the
project data dir. For more info see: :ref:`topics-project-structure`.

.. setting:: HTTPCACHE_IGNORE_HTTP_CODES

HTTPCACHE_IGNORE_HTTP_CODES
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``[]``

Don't cache response with these HTTP codes.

.. setting:: HTTPCACHE_IGNORE_MISSING

HTTPCACHE_IGNORE_MISSING
^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``False``

If enabled, requests not found in the cache will be ignored instead of downloaded.

.. setting:: HTTPCACHE_IGNORE_SCHEMES

HTTPCACHE_IGNORE_SCHEMES
^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``['file']``

Don't cache responses with these URI schemes.

.. setting:: HTTPCACHE_STORAGE

HTTPCACHE_STORAGE
^^^^^^^^^^^^^^^^^

Default: ``'scrapy.extensions.httpcache.FilesystemCacheStorage'``

The class which implements the cache storage backend.

.. setting:: HTTPCACHE_DBM_MODULE

HTTPCACHE_DBM_MODULE
^^^^^^^^^^^^^^^^^^^^

Default: ``'dbm'``

The database module to use in the :ref:`DBM storage backend
<httpcache-storage-dbm>`. This setting is specific to the DBM backend.

.. setting:: HTTPCACHE_POLICY

HTTPCACHE_POLICY
^^^^^^^^^^^^^^^^

Default: ``'scrapy.extensions.httpcache.DummyPolicy'``

The class which implements the cache policy.

.. setting:: HTTPCACHE_GZIP

HTTPCACHE_GZIP
^^^^^^^^^^^^^^

Default: ``False``

If enabled, will compress all cached data with gzip.
This setting is specific to the Filesystem backend.

.. setting:: HTTPCACHE_ALWAYS_STORE

HTTPCACHE_ALWAYS_STORE
^^^^^^^^^^^^^^^^^^^^^^

Default: ``False``

If enabled, will cache pages unconditionally.

A spider may wish to have all responses available in the cache, for
future use with ``Cache-Control: max-stale``, for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.

This setting still respects ``Cache-Control: no-store`` directives in responses.
If you don't want that, filter ``no-store`` out of the Cache-Control headers in
responses you feed to the cache middleware.

.. setting:: HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS

HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``[]``

List of Cache-Control directives in responses to be ignored.

Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it actually respects those
directives. This allows to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives
in requests unless it actually needs them, so directives in requests are
not filtered.

HttpCompressionMiddleware
-------------------------

.. module:: scrapy.downloadermiddlewares.httpcompression
   :synopsis: Http Compression Middleware

.. class:: HttpCompressionMiddleware

   This middleware allows compressed (gzip, deflate) traffic to be
   sent/received from web sites.

   This middleware also supports decoding `brotli-compressed`_ as well as
   `zstd-compressed`_ responses, provided that `brotli`_ or `zstandard`_ is
   installed, respectively.

.. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt
.. _brotli: https://pypi.org/project/Brotli/
.. _zstd-compressed: https://www.ietf.org/rfc/rfc8478.txt
.. _zstandard: https://pypi.org/project/zstandard/

HttpCompressionMiddleware Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: COMPRESSION_ENABLED

COMPRESSION_ENABLED
^^^^^^^^^^^^^^^^^^^

Default: ``True``

Whether the Compression middleware will be enabled.

HttpProxyMiddleware
-------------------

.. module:: scrapy.downloadermiddlewares.httpproxy
   :synopsis: Http Proxy Middleware

.. reqmeta:: proxy

.. class:: HttpProxyMiddleware

   This middleware sets the HTTP proxy to use for requests, by setting the
   :reqmeta:`proxy` meta value for :class:`~scrapy.Request` objects.

   Like the Python standard library module :mod:`urllib.request`, it obeys
   the following environment variables:

   * ``http_proxy``
   * ``https_proxy``
   * ``no_proxy``

   You can also set the meta key :reqmeta:`proxy` per-request, to a value like
   ``http://some_proxy_server:port`` or ``http://username:password@some_proxy_server:port``.
   Keep in mind this value will take precedence over ``http_proxy``/``https_proxy``
   environment variables, and it will also ignore ``no_proxy`` environment variable.

.. note::

    Handling of this meta key needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. It's currently unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

HttpProxyMiddleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: HTTPPROXY_ENABLED
.. setting:: HTTPPROXY_AUTH_ENCODING

HTTPPROXY_ENABLED
^^^^^^^^^^^^^^^^^

Default: ``True``

Whether or not to enable the :class:`HttpProxyMiddleware`.

HTTPPROXY_AUTH_ENCODING
^^^^^^^^^^^^^^^^^^^^^^^

Default: ``"latin-1"``

The default encoding for proxy authentication on :class:`HttpProxyMiddleware`.

OffsiteMiddleware
-----------------

.. module:: scrapy.downloadermiddlewares.offsite
   :synopsis: Offsite Middleware

.. class:: OffsiteMiddleware

   .. versionadded:: 2.11.2

   Filters out Requests for URLs outside the domains covered by the spider.

   This middleware filters out every request whose host names aren't in the
   spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
   All subdomains of any domain in the list are also allowed.
   E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
   but not ``www2.example.com`` nor ``example.com``.

   When your spider returns a request for a domain not belonging to those
   covered by the spider, this middleware will log a debug message similar to
   this one::

      DEBUG: Filtered offsite request to 'offsite.example': <GET http://offsite.example/some/page.html>

   To avoid filling the log with too much noise, it will only print one of
   these messages for each new domain filtered. So, for example, if another
   request for ``offsite.example`` is filtered, no log message will be
   printed. But if a request for ``other.example`` is filtered, a message
   will be printed (but only for the first request filtered).

   If the spider doesn't define an
   :attr:`~scrapy.Spider.allowed_domains` attribute, or the
   attribute is empty, the offsite middleware will allow all requests.

   .. reqmeta:: allow_offsite

   If the request has the :attr:`~scrapy.Request.dont_filter` attribute set to
   ``True`` or :attr:`Request.meta` has ``allow_offsite`` set to ``True``, then
   the OffsiteMiddleware will allow the request even if its domain is not listed
   in allowed domains.

RedirectMiddleware
------------------

.. module:: scrapy.downloadermiddlewares.redirect
   :synopsis: Redirection Middleware

.. class:: RedirectMiddleware

   This middleware handles redirection of requests based on response status.

.. reqmeta:: redirect_urls

The urls which the request goes through (while being redirected) can be found
in the ``redirect_urls`` :attr:`Request.meta <scrapy.Request.meta>` key.

.. reqmeta:: redirect_reasons

The reason behind each redirect in :reqmeta:`redirect_urls` can be found in the
``redirect_reasons`` :attr:`Request.meta <scrapy.Request.meta>` key. For
example: ``[301, 302, 307, 'meta refresh']``.

The format of a reason depends on the middleware that handled the corresponding
redirect. For example, :class:`RedirectMiddleware` indicates the triggering
response status code as an integer, while :class:`MetaRefreshMiddleware`
always uses the ``'meta refresh'`` string as reason.

The :class:`RedirectMiddleware` can be configured through the following
settings (see the settings documentation for more info):

* :setting:`REDIRECT_ENABLED`
* :setting:`REDIRECT_MAX_TIMES`

.. reqmeta:: dont_redirect

If :attr:`Request.meta <scrapy.Request.meta>` has ``dont_redirect``
key set to True, the request will be ignored by this middleware.

If you want to handle some redirect status codes in your spider, you can
specify these in the ``handle_httpstatus_list`` spider attribute.

For example, if you want the redirect middleware to ignore 301 and 302
responses (and pass them through to your spider) you can do this:

.. code-block:: python

    class MySpider(CrawlSpider):
        handle_httpstatus_list = [301, 302]

The ``handle_httpstatus_list`` key of :attr:`Request.meta
<scrapy.Request.meta>` can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key
``handle_httpstatus_all`` to ``True`` if you want to allow any response code
for a request.

RedirectMiddleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: REDIRECT_ENABLED

REDIRECT_ENABLED
^^^^^^^^^^^^^^^^

Default: ``True``

Whether the Redirect middleware will be enabled.

.. setting:: REDIRECT_MAX_TIMES

REDIRECT_MAX_TIMES
^^^^^^^^^^^^^^^^^^

Default: ``20``

The maximum number of redirections that will be followed for a single request.
If maximum redirections are exceeded, the request is aborted and ignored.

MetaRefreshMiddleware
---------------------

.. class:: MetaRefreshMiddleware

   This middleware handles redirection of requests based on meta-refresh html tag.

The :class:`MetaRefreshMiddleware` can be configured through the following
settings (see the settings documentation for more info):

* :setting:`METAREFRESH_ENABLED`
* :setting:`METAREFRESH_IGNORE_TAGS`
* :setting:`METAREFRESH_MAXDELAY`

This middleware obey :setting:`REDIRECT_MAX_TIMES` setting, :reqmeta:`dont_redirect`,
:reqmeta:`redirect_urls` and :reqmeta:`redirect_reasons` request meta keys as described
for :class:`RedirectMiddleware`

MetaRefreshMiddleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: METAREFRESH_ENABLED

METAREFRESH_ENABLED
^^^^^^^^^^^^^^^^^^^

Default: ``True``

Whether the Meta Refresh middleware will be enabled.

.. setting:: METAREFRESH_IGNORE_TAGS

METAREFRESH_IGNORE_TAGS
^^^^^^^^^^^^^^^^^^^^^^^

Default: ``[]``

Meta tags within these tags are ignored.

.. versionchanged:: 2.11.2
   The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
   ``[]`` to ``["noscript"]``.

.. setting:: METAREFRESH_MAXDELAY

METAREFRESH_MAXDELAY
^^^^^^^^^^^^^^^^^^^^

Default: ``100``

The maximum meta-refresh delay (in seconds) to follow the redirection.
Some sites use meta-refresh for redirecting to a session expired page, so we
restrict automatic redirection to the maximum delay.

RetryMiddleware
---------------

.. module:: scrapy.downloadermiddlewares.retry
   :synopsis: Retry Middleware

.. class:: RetryMiddleware

   A middleware to retry failed requests that are potentially caused by
   temporary problems such as a connection timeout or HTTP 500 error.

Failed pages are collected on the scraping process and rescheduled at the
end, once the spider has finished crawling all regular (non failed) pages.

The :class:`RetryMiddleware` can be configured through the following
settings (see the settings documentation for more info):

* :setting:`RETRY_ENABLED`
* :setting:`RETRY_TIMES`
* :setting:`RETRY_HTTP_CODES`
* :setting:`RETRY_EXCEPTIONS`

.. reqmeta:: dont_retry

If :attr:`Request.meta <scrapy.Request.meta>` has ``dont_retry`` key
set to True, the request will be ignored by this middleware.

To retry requests from a spider callback, you can use the
:func:`get_retry_request` function:

.. autofunction:: get_retry_request

RetryMiddleware Settings
~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: RETRY_ENABLED

RETRY_ENABLED
^^^^^^^^^^^^^

Default: ``True``

Whether the Retry middleware will be enabled.

.. setting:: RETRY_TIMES

RETRY_TIMES
^^^^^^^^^^^

Default: ``2``

Maximum number of times to retry, in addition to the first download.

Maximum number of retries can also be specified per-request using
:reqmeta:`max_retry_times` attribute of :attr:`Request.meta <scrapy.Request.meta>`.
When initialized, the :reqmeta:`max_retry_times` meta key takes higher
precedence over the :setting:`RETRY_TIMES` setting.

.. setting:: RETRY_HTTP_CODES

RETRY_HTTP_CODES
^^^^^^^^^^^^^^^^

Default: ``[500, 502, 503, 504, 522, 524, 408, 429]``

Which HTTP response codes to retry. Other errors (DNS lookup issues,
connections lost, etc) are always retried.

In some cases you may want to add 400 to :setting:`RETRY_HTTP_CODES` because
it is a common code used to indicate server overload. It is not included by
default because HTTP specs say so.

.. setting:: RETRY_EXCEPTIONS

RETRY_EXCEPTIONS
^^^^^^^^^^^^^^^^

Default::

    [
        'scrapy.exceptions.CannotResolveHostError',
        'scrapy.exceptions.DownloadConnectionRefusedError',
        'scrapy.exceptions.DownloadFailedError',
        'scrapy.exceptions.DownloadTimeoutError',
        'scrapy.exceptions.ResponseDataLossError',
        'twisted.internet.error.ConnectionDone',
        'twisted.internet.error.ConnectError',
        'twisted.internet.error.ConnectionLost',
        IOError,
        'scrapy.core.downloader.handlers.http11.TunnelError',
    ]

List of exceptions to retry.

Each list entry may be an exception type or its import path as a string.

An exception will not be caught when the exception type is not in
:setting:`RETRY_EXCEPTIONS` or when the maximum number of retries for a request
has been exceeded (see :setting:`RETRY_TIMES`). To learn about uncaught
exception propagation, see
:meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_exception`.

.. setting:: RETRY_PRIORITY_ADJUST

RETRY_PRIORITY_ADJUST
^^^^^^^^^^^^^^^^^^^^^

Default: ``-1``

Adjust retry request priority relative to original request:

- a positive priority adjust means higher priority.
- **a negative priority adjust (default) means lower priority.**

.. _topics-dlmw-robots:

RobotsTxtMiddleware
-------------------

.. module:: scrapy.downloadermiddlewares.robotstxt
   :synopsis: robots.txt middleware

.. class:: RobotsTxtMiddleware

    This middleware filters out requests forbidden by the robots.txt exclusion
    standard.

    To make sure Scrapy respects robots.txt make sure the middleware is enabled
    and the :setting:`ROBOTSTXT_OBEY` setting is enabled.

    The :setting:`ROBOTSTXT_USER_AGENT` setting can be used to specify the
    user agent string to use for matching in the robots.txt_ file. If it
    is ``None``, the User-Agent header you are sending with the request or the
    :setting:`USER_AGENT` setting (in that order) will be used for determining
    the user agent to use in the robots.txt_ file.

    This middleware has to be combined with a robots.txt_ parser.

    Scrapy ships with support for the following robots.txt_ parsers:

    * :ref:`Protego <protego-parser>` (default)
    * :ref:`RobotFileParser <python-robotfileparser>`
    * :ref:`Robotexclusionrulesparser <rerp-parser>`

    You can change the robots.txt_ parser with the :setting:`ROBOTSTXT_PARSER`
    setting. Or you can also :ref:`implement support for a new parser <support-for-new-robots-parser>`.

.. reqmeta:: dont_obey_robotstxt

If :attr:`Request.meta <scrapy.Request.meta>` has
``dont_obey_robotstxt`` key set to True
the request will be ignored by this middleware even if
:setting:`ROBOTSTXT_OBEY` is enabled.

Parsers vary in several aspects:

* Language of implementation

* Supported specification

* Support for wildcard matching

* Usage of `length based rule <https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec#order-of-precedence-for-rules>`_:
  in particular for ``Allow`` and ``Disallow`` directives, where the most
  specific rule based on the length of the path trumps the less specific
  (shorter) rule

Performance comparison of different parsers is available at `the following link
<https://github.com/scrapy/scrapy/issues/3969>`_.

.. _protego-parser:

Protego parser
~~~~~~~~~~~~~~

Based on `Protego <https://github.com/scrapy/protego>`_:

* implemented in Python

* is compliant with `Google's Robots.txt Specification
  <https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec>`_

* supports wildcard matching

* uses the length based rule

Scrapy uses this parser by default.

.. _python-robotfileparser:

RobotFileParser
~~~~~~~~~~~~~~~

Based on :class:`~urllib.robotparser.RobotFileParser`:

* is Python's built-in robots.txt_ parser

* is compliant with `Martijn Koster's 1996 draft specification
  <https://www.robotstxt.org/norobots-rfc.txt>`_

* lacks support for wildcard matching

* doesn't use the length based rule

It is faster than Protego and backward-compatible with versions of Scrapy before 1.8.0.

In order to use this parser, set:

* :setting:`ROBOTSTXT_PARSER` to ``scrapy.robotstxt.PythonRobotParser``

.. _rerp-parser:

Robotexclusionrulesparser
~~~~~~~~~~~~~~~~~~~~~~~~~

Based on `Robotexclusionrulesparser <https://pypi.org/project/robotexclusionrulesparser/>`_:

* implemented in Python

* is compliant with `Martijn Koster's 1996 draft specification
  <https://www.robotstxt.org/norobots-rfc.txt>`_

* supports wildcard matching

* doesn't use the length based rule

In order to use this parser:

* Install ``Robotexclusionrulesparser`` by running
  ``pip install robotexclusionrulesparser``

* Set :setting:`ROBOTSTXT_PARSER` setting to
  ``scrapy.robotstxt.RerpRobotParser``

.. _support-for-new-robots-parser:

Implementing support for a new parser
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can implement support for a new robots.txt_ parser by subclassing
the abstract base class :class:`~scrapy.robotstxt.RobotParser` and
implementing the methods described below.

.. module:: scrapy.robotstxt
   :synopsis: robots.txt parser interface and implementations

.. autoclass:: RobotParser
   :members:

.. _robots.txt: https://www.robotstxt.org/

DownloaderStats
---------------

.. module:: scrapy.downloadermiddlewares.stats
   :synopsis: Downloader Stats Middleware

.. class:: DownloaderStats

   Middleware that stores stats of all requests, responses and exceptions that
   pass through it.

   To use this middleware you must enable the :setting:`DOWNLOADER_STATS`
   setting.

UserAgentMiddleware
-------------------

.. module:: scrapy.downloadermiddlewares.useragent
   :synopsis: User Agent Middleware

.. class:: UserAgentMiddleware

   Middleware that sets the ``User-Agent`` header.

   The header value is taken from the :setting:`USER_AGENT` setting.

.. _DBM: https://en.wikipedia.org/wiki/Dbm


.. _topics-dynamic-content:

====================================
Selecting dynamically-loaded content
====================================

Some webpages show the desired data when you load them in a web browser.
However, when you download them using Scrapy, you cannot reach the desired data
using :ref:`selectors <topics-selectors>`.

When this happens, the recommended approach is to
:ref:`find the data source <topics-finding-data-source>` and extract the data
from it.

If you fail to do that, and you can nonetheless access the desired data through
the :ref:`DOM <topics-livedom>` from your web browser, see
:ref:`topics-headless-browsing`.

.. _topics-finding-data-source:

Finding the data source
=======================

To extract the desired data, you must first find its source location.

If the data is in a non-text-based format, such as an image or a PDF document,
use the :ref:`network tool <topics-network-tool>` of your web browser to find
the corresponding request, and :ref:`reproduce it
<topics-reproducing-requests>`.

If your web browser lets you select the desired data as text, the data may be
defined in embedded JavaScript code, or loaded from an external resource in a
text-based format.

In that case, you can use a tool like wgrep_ to find the URL of that resource.

If the data turns out to come from the original URL itself, you must
:ref:`inspect the source code of the webpage <topics-inspecting-source>` to
determine where the data is located.

If the data comes from a different URL, you will need to :ref:`reproduce the
corresponding request <topics-reproducing-requests>`.

.. _topics-inspecting-source:

Inspecting the source code of a webpage
=======================================

Sometimes you need to inspect the source code of a webpage (not the
:ref:`DOM <topics-livedom>`) to determine where some desired data is located.

Use Scrapy’s :command:`fetch` command to download the webpage contents as seen
by Scrapy::

    scrapy fetch --nolog https://example.com > response.html

If the desired data is in embedded JavaScript code within a ``<script/>``
element, see :ref:`topics-parsing-javascript`.

If you cannot find the desired data, first make sure it’s not just Scrapy:
download the webpage with an HTTP client like curl_ or wget_ and see if the
information can be found in the response they get.

If they get a response with the desired data, modify your Scrapy
:class:`~scrapy.Request` to match that of the other HTTP client. For
example, try using the same user-agent string (:setting:`USER_AGENT`) or the
same :attr:`~scrapy.Request.headers`.

If they also get a response without the desired data, you’ll need to take
steps to make your request more similar to that of the web browser. See
:ref:`topics-reproducing-requests`.

.. _topics-reproducing-requests:

Reproducing requests
====================

Sometimes we need to reproduce a request the way our web browser performs it.

Use the :ref:`network tool <topics-network-tool>` of your web browser to see
how your web browser performs the desired request, and try to reproduce that
request with Scrapy.

It might be enough to yield a :class:`~scrapy.Request` with the same HTTP
method and URL. However, you may also need to reproduce the body, headers and
form parameters (see :class:`~scrapy.FormRequest`) of that request.

As all major browsers allow to export the requests in curl_ format, Scrapy
incorporates the method :meth:`~scrapy.Request.from_curl` to generate an equivalent
:class:`~scrapy.Request` from a cURL command. To get more information
visit :ref:`request from curl <requests-from-curl>` inside the network
tool section.

Once you get the expected response, you can :ref:`extract the desired data from
it <topics-handling-response-formats>`.

You can reproduce any request with Scrapy. However, some times reproducing all
necessary requests may not seem efficient in developer time. If that is your
case, and crawling speed is not a major concern for you, you can alternatively
consider :ref:`using a headless browser <topics-headless-browsing>`.

If you get the expected response `sometimes`, but not always, the issue is
probably not your request, but the target server. The target server might be
buggy, overloaded, or :ref:`banning <bans>` some of your requests.

Note that to translate a cURL command into a Scrapy request,
you may use `curl2scrapy <https://michael-shub.github.io/curl2scrapy/>`_.

.. _topics-handling-response-formats:

Handling different response formats
===================================

.. skip: start

Once you have a response with the desired data, how you extract the desired
data from it depends on the type of response:

-   If the response is HTML, XML or JSON, use :ref:`selectors
    <topics-selectors>` as usual.

-   If the response is JSON, use :func:`response.json()
    <scrapy.http.TextResponse.json>` to load the desired data:

    .. code-block:: python

        data = response.json()

    If the desired data is inside HTML or XML code embedded within JSON data,
    you can load that HTML or XML code into a
    :class:`~scrapy.Selector` and then
    :ref:`use it <topics-selectors>` as usual:

    .. code-block:: python

        selector = Selector(data["html"])

-   If the response is JavaScript, or HTML with a ``<script/>`` element
    containing the desired data, see :ref:`topics-parsing-javascript`.

-   If the response is CSS, use a :doc:`regular expression <library/re>` to
    extract the desired data from
    :attr:`response.text <scrapy.http.TextResponse.text>`.

.. _topics-parsing-images:

-   If the response is an image or another format based on images (e.g. PDF),
    read the response as bytes from
    :attr:`response.body <scrapy.http.Response.body>` and use an OCR
    solution to extract the desired data as text.

    For example, you can use pytesseract_. To read a table from a PDF,
    `tabula-py`_ may be a better choice.

-   If the response is SVG, or HTML with embedded SVG containing the desired
    data, you may be able to extract the desired data using
    :ref:`selectors <topics-selectors>`, since SVG is based on XML.

    Otherwise, you might need to convert the SVG code into a raster image, and
    :ref:`handle that raster image <topics-parsing-images>`.

.. skip: end

.. _topics-parsing-javascript:

Parsing JavaScript code
=======================

.. skip: start

If the desired data is hardcoded in JavaScript, you first need to get the
JavaScript code:

-   If the JavaScript code is in a JavaScript file, simply read
    :attr:`response.text <scrapy.http.TextResponse.text>`.

-   If the JavaScript code is within a ``<script/>`` element of an HTML page,
    use :ref:`selectors <topics-selectors>` to extract the text within that
    ``<script/>`` element.

Once you have a string with the JavaScript code, you can extract the desired
data from it:

-   You might be able to use a :doc:`regular expression <library/re>` to
    extract the desired data in JSON format, which you can then parse with
    :func:`json.loads`.

    For example, if the JavaScript code contains a separate line like
    ``var data = {"field": "value"};`` you can extract that data as follows:

    .. code-block:: pycon

        >>> pattern = r"\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n"
        >>> json_data = response.css("script::text").re_first(pattern)
        >>> json.loads(json_data)
        {'field': 'value'}

-   chompjs_ provides an API to parse JavaScript objects into a :class:`dict`.

    For example, if the JavaScript code contains
    ``var data = {field: "value", secondField: "second value"};``
    you can extract that data as follows:

    .. code-block:: pycon

        >>> import chompjs
        >>> javascript = response.css("script::text").get()
        >>> data = chompjs.parse_js_object(javascript)
        >>> data
        {'field': 'value', 'secondField': 'second value'}

-   Otherwise, use js2xml_ to convert the JavaScript code into an XML document
    that you can parse using :ref:`selectors <topics-selectors>`.

    For example, if the JavaScript code contains
    ``var data = {field: "value"};`` you can extract that data as follows:

    .. code-block:: pycon

        >>> import js2xml
        >>> import lxml.etree
        >>> from parsel import Selector
        >>> javascript = response.css("script::text").get()
        >>> xml = lxml.etree.tostring(js2xml.parse(javascript), encoding="unicode")
        >>> selector = Selector(text=xml)
        >>> selector.css('var[name="data"]').get()
        '<var name="data"><object><property name="field"><string>value</string></property></object></var>'

.. skip: end

.. _topics-headless-browsing:

Using a headless browser
========================

On webpages that fetch data from additional requests, reproducing those
requests that contain the desired data is the preferred approach. The effort is
often worth the result: structured, complete data with minimum parsing time and
network transfer.

However, sometimes it can be really hard to reproduce certain requests. Or you
may need something that no request can give you, such as a screenshot of a
webpage as seen in a web browser. In this case using a `headless browser`_ will
help.

A headless browser is a special web browser that provides an API for
automation. By installing the :ref:`asyncio reactor <install-asyncio>`,
it is possible to integrate ``asyncio``-based libraries which handle headless browsers.

One such library is `playwright-python`_ (an official Python port of `playwright`_).
The following is a simple snippet to illustrate its usage within a Scrapy spider:

.. skip: next
.. code-block:: python

    import scrapy
    from playwright.async_api import async_playwright

    class PlaywrightSpider(scrapy.Spider):
        name = "playwright"
        start_urls = ["data:,"]  # avoid using the default Scrapy downloader

        async def parse(self, response):
            async with async_playwright() as pw:
                browser = await pw.chromium.launch()
                page = await browser.new_page()
                await page.goto("https://example.org")
                title = await page.title()
                return {"title": title}

However, using `playwright-python`_ directly as in the above example
circumvents most of the Scrapy components (middlewares, dupefilter, etc).
We recommend using `scrapy-playwright`_ for a better integration.

.. _CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets
.. _chompjs: https://github.com/Nykakin/chompjs
.. _curl: https://curl.se/
.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
.. _js2xml: https://github.com/scrapinghub/js2xml
.. _playwright-python: https://github.com/microsoft/playwright-python
.. _playwright: https://github.com/microsoft/playwright
.. _pytesseract: https://github.com/madmaze/pytesseract
.. _scrapy-playwright: https://github.com/scrapy-plugins/scrapy-playwright
.. _tabula-py: https://github.com/chezou/tabula-py
.. _wget: https://www.gnu.org/software/wget/
.. _wgrep: https://github.com/stav/wgrep


.. _topics-exceptions:

==========
Exceptions
==========

.. module:: scrapy.exceptions
   :synopsis: Scrapy exceptions

.. _topics-exceptions-ref:

Built-in Exceptions reference
=============================

Here's a list of all exceptions included in Scrapy and their usage.

CloseSpider
-----------

.. exception:: CloseSpider(reason='cancelled')

    This exception can be raised from a spider callback to request the spider to be
    closed/stopped. Supported arguments:

    :param reason: the reason for closing
    :type reason: str

For example:

.. code-block:: python

    def parse_page(self, response):
        if "Bandwidth exceeded" in response.body:
            raise CloseSpider("bandwidth_exceeded")

DontCloseSpider
---------------

.. exception:: DontCloseSpider

This exception can be raised in a :signal:`spider_idle` signal handler to
prevent the spider from being closed.

DropItem
--------

.. exception:: DropItem

The exception that must be raised by item pipeline stages to stop processing an
Item. For more information see :ref:`topics-item-pipeline`.

IgnoreRequest
-------------

.. exception:: IgnoreRequest

This exception can be raised by the Scheduler or any downloader middleware to
indicate that the request should be ignored.

NotConfigured
-------------

.. exception:: NotConfigured

This exception can be raised by some components to indicate that they will
remain disabled. Those components include:

-   Extensions
-   Item pipelines
-   Downloader middlewares
-   Spider middlewares

The exception must be raised in the component's ``__init__`` method.

NotSupported
------------

.. exception:: NotSupported

This exception is raised to indicate an unsupported feature.

StopDownload
-------------

.. exception:: StopDownload(fail=True)

Raised from a :class:`~scrapy.signals.bytes_received` or :class:`~scrapy.signals.headers_received`
signal handler to indicate that no further bytes should be downloaded for a response.

The ``fail`` boolean parameter controls which method will handle the resulting
response:

* If ``fail=True`` (default), the request errback is called. The response object is
  available as the ``response`` attribute of the ``StopDownload`` exception,
  which is in turn stored as the ``value`` attribute of the received
  :class:`~twisted.python.failure.Failure` object. This means that in an errback
  defined as ``def errback(self, failure)``, the response can be accessed though
  ``failure.value.response``.

* If ``fail=False``, the request callback is called instead.

In both cases, the response could have its body truncated: the body contains
all bytes received up until the exception is raised, including the bytes
received in the signal handler that raises the exception. Also, the response
object is marked with ``"download_stopped"`` in its :attr:`~scrapy.http.Response.flags`
attribute.

.. note:: ``fail`` is a keyword-only parameter, i.e. raising
    ``StopDownload(False)`` or ``StopDownload(True)`` will raise
    a :class:`TypeError`.

See the documentation for the :class:`~scrapy.signals.bytes_received` and
:class:`~scrapy.signals.headers_received` signals
and the :ref:`topics-stop-response-download` topic for additional information and examples.


.. _topics-exporters:

==============
Item Exporters
==============

.. module:: scrapy.exporters
   :synopsis: Item Exporters

Once you have scraped your items, you often want to persist or export those
items, to use the data in some other application. That is, after all, the whole
purpose of the scraping process.

For this purpose Scrapy provides a collection of Item Exporters for different
output formats, such as XML, CSV or JSON.

Using Item Exporters
====================

If you are in a hurry, and just want to use an Item Exporter to output scraped
data see the :ref:`topics-feed-exports`. Otherwise, if you want to know how
Item Exporters work or need more custom functionality (not covered by the
default exports), continue reading below.

In order to use an Item Exporter, you  must instantiate it with its required
args. Each Item Exporter requires different arguments, so check each exporter
documentation to be sure, in :ref:`topics-exporters-reference`. After you have
instantiated your exporter, you have to:

1. call the method :meth:`~BaseItemExporter.start_exporting` in order to
signal the beginning of the exporting process

2. call the :meth:`~BaseItemExporter.export_item` method for each item you want
to export

3. and finally call the :meth:`~BaseItemExporter.finish_exporting` to signal
the end of the exporting process

Here you can see an :doc:`Item Pipeline <item-pipeline>` which uses multiple
Item Exporters to group scraped items to different files according to the
value of one of their fields:

.. code-block:: python

    from itemadapter import ItemAdapter
    from scrapy.exporters import XmlItemExporter

    class PerYearXmlExportPipeline:
        """Distribute items across multiple XML files according to their 'year' field"""

        def open_spider(self, spider):
            self.year_to_exporter = {}

        def close_spider(self, spider):
            for exporter, xml_file in self.year_to_exporter.values():
                exporter.finish_exporting()
                xml_file.close()

        def _exporter_for_item(self, item):
            adapter = ItemAdapter(item)
            year = adapter["year"]
            if year not in self.year_to_exporter:
                xml_file = open(f"{year}.xml", "wb")
                exporter = XmlItemExporter(xml_file)
                exporter.start_exporting()
                self.year_to_exporter[year] = (exporter, xml_file)
            return self.year_to_exporter[year][0]

        def process_item(self, item):
            exporter = self._exporter_for_item(item)
            exporter.export_item(item)
            return item

.. _topics-exporters-field-serialization:

Serialization of item fields
============================

By default, the field values are passed unmodified to the underlying
serialization library, and the decision of how to serialize them is delegated
to each particular serialization library.

However, you can customize how each field value is serialized *before it is
passed to the serialization library*.

There are two ways to customize how a field will be serialized, which are
described next.

.. _topics-exporters-serializers:

1. Declaring a serializer in the field
--------------------------------------

If you use :class:`~scrapy.Item` you can declare a serializer in the
:ref:`field metadata <topics-items-fields>`. The serializer must be
a callable which receives a value and returns its serialized form.

Example:

.. code-block:: python

    import scrapy

    def serialize_price(value):
        return f"$ {str(value)}"

    class Product(scrapy.Item):
        name = scrapy.Field()
        price = scrapy.Field(serializer=serialize_price)

2. Overriding the serialize_field() method
------------------------------------------

You can also override the :meth:`~BaseItemExporter.serialize_field` method to
customize how your field value will be exported.

Make sure you call the base class :meth:`~BaseItemExporter.serialize_field` method
after your custom code.

Example:

.. code-block:: python

      from scrapy.exporters import XmlItemExporter

      class ProductXmlExporter(XmlItemExporter):
          def serialize_field(self, field, name, value):
              if name == "price":
                  return f"$ {str(value)}"
              return super().serialize_field(field, name, value)

.. _topics-exporters-reference:

Built-in Item Exporters reference
=================================

Here is a list of the Item Exporters bundled with Scrapy. Some of them contain
output examples, which assume you're exporting these two items:

.. skip: next
.. code-block:: python

    Item(name="Color TV", price="1200")
    Item(name="DVD player", price="200")

BaseItemExporter
----------------

.. class:: BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0, dont_fail=False)

   This is the (abstract) base class for all Item Exporters. It provides
   support for common features used by all (concrete) Item Exporters, such as
   defining what fields to export, whether to export empty fields, or which
   encoding to use.

   These features can be configured through the ``__init__`` method arguments which
   populate their respective instance attributes: :attr:`fields_to_export`,
   :attr:`export_empty_fields`, :attr:`encoding`, :attr:`indent`.

   .. method:: export_item(item)

      Exports the given item. This method must be implemented in subclasses.

   .. method:: serialize_field(field, name, value)

      Return the serialized value for the given field. You can override this
      method (in your custom Item Exporters) if you want to control how a
      particular field or value will be serialized/exported.

      By default, this method looks for a serializer :ref:`declared in the item
      field <topics-exporters-serializers>` and returns the result of applying
      that serializer to the value. If no serializer is found, it returns the
      value unchanged.

      :param field: the field being serialized. If the source :ref:`item object
          <item-types>` does not define field metadata, *field* is an empty
          :class:`dict`.
      :type field: :class:`~scrapy.Field` object or a :class:`dict` instance

      :param name: the name of the field being serialized
      :type name: str

      :param value: the value being serialized

   .. method:: start_exporting()

      Signal the beginning of the exporting process. Some exporters may use
      this to generate some required header (for example, the
      :class:`XmlItemExporter`). You must call this method before exporting any
      items.

   .. method:: finish_exporting()

      Signal the end of the exporting process. Some exporters may use this to
      generate some required footer (for example, the
      :class:`XmlItemExporter`). You must always call this method after you
      have no more items to export.

   .. attribute:: fields_to_export

      Fields to export, their order [1]_ and their output names.

      Possible values are:

      -   ``None`` (all fields [2]_, default)

      -   A list of fields::

              ['field1', 'field2']

      -   A dict where keys are fields and values are output names::

              {'field1': 'Field 1', 'field2': 'Field 2'}

      .. [1] Not all exporters respect the specified field order.
      .. [2] When using :ref:`item objects <item-types>` that do not expose
             all their possible fields, exporters that do not support exporting
             a different subset of fields per item will only export the fields
             found in the first item exported.

   .. attribute:: export_empty_fields

      Whether to include empty/unpopulated item fields in the exported data.
      Defaults to ``False``. Some exporters (like :class:`CsvItemExporter`)
      ignore this attribute and always export all empty fields.

      This option is ignored for dict items.

   .. attribute:: encoding

      The output character encoding.

   .. attribute:: indent

      Amount of spaces used to indent the output on each level. Defaults to ``0``.

      * ``indent=None`` selects the most compact representation,
        all items in the same line with no indentation
      * ``indent<=0`` each item on its own line, no indentation
      * ``indent>0`` each item on its own line, indented with the provided numeric value

PythonItemExporter
------------------

.. autoclass:: PythonItemExporter

.. highlight:: none

XmlItemExporter
---------------

.. class:: XmlItemExporter(file, item_element='item', root_element='items', **kwargs)

   Exports items in XML format to the specified file object.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   :param root_element: The name of root element in the exported XML.
   :type root_element: str

   :param item_element: The name of each item element in the exported XML.
   :type item_element: str

   The additional keyword arguments of this ``__init__`` method are passed to the
   :class:`BaseItemExporter` ``__init__`` method.

   A typical output of this exporter would be::

       <?xml version="1.0" encoding="utf-8"?>
       <items>
         <item>
           <name>Color TV</name>
           <price>1200</price>
        </item>
         <item>
           <name>DVD player</name>
           <price>200</price>
        </item>
       </items>

   Unless overridden in the :meth:`serialize_field` method, multi-valued fields are
   exported by serializing each value inside a ``<value>`` element. This is for
   convenience, as multi-valued fields are very common.

   For example, the item::

        Item(name=['John', 'Doe'], age='23')

   Would be serialized as::

       <?xml version="1.0" encoding="utf-8"?>
       <items>
         <item>
           <name>
             <value>John</value>
             <value>Doe</value>
           </name>
           <age>23</age>
         </item>
       </items>

CsvItemExporter
---------------

.. class:: CsvItemExporter(file, include_headers_line=True, join_multivalued=',', errors=None, **kwargs)

   Exports items in CSV format to the given file-like object. If the
   :attr:`fields_to_export` attribute is set, it will be used to define the
   CSV columns, their order and their column names. The
   :attr:`export_empty_fields` attribute has no effect on this exporter.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   :param include_headers_line: If enabled, makes the exporter output a header
      line with the field names taken from
      :attr:`BaseItemExporter.fields_to_export` or the first exported item fields.
   :type include_headers_line: bool

   :param join_multivalued: The char (or chars) that will be used for joining
      multi-valued fields, if found.
   :type include_headers_line: str

   :param errors: The optional string that specifies how encoding and decoding
      errors are to be handled. For more information see
      :class:`io.TextIOWrapper`.
   :type errors: str

   The additional keyword arguments of this ``__init__`` method are passed to the
   :class:`BaseItemExporter` ``__init__`` method, and the leftover arguments to the
   :func:`csv.writer` function, so you can use any :func:`csv.writer` function
   argument to customize this exporter.

   A typical output of this exporter would be::

      product,price
      Color TV,1200
      DVD player,200

PickleItemExporter
------------------

.. class:: PickleItemExporter(file, protocol=0, **kwargs)

   Exports items in pickle format to the given file-like object.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   :param protocol: The pickle protocol to use.
   :type protocol: int

   For more information, see :mod:`pickle`.

   The additional keyword arguments of this ``__init__`` method are passed to the
   :class:`BaseItemExporter` ``__init__`` method.

   Pickle isn't a human readable format, so no output examples are provided.

PprintItemExporter
------------------

.. class:: PprintItemExporter(file, **kwargs)

   Exports items in pretty print format to the specified file object.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   The additional keyword arguments of this ``__init__`` method are passed to the
   :class:`BaseItemExporter` ``__init__`` method.

   A typical output of this exporter would be::

        {'name': 'Color TV', 'price': '1200'}
        {'name': 'DVD player', 'price': '200'}

   Longer lines (when present) are pretty-formatted.

JsonItemExporter
----------------

.. class:: JsonItemExporter(file, **kwargs)

   Exports items in JSON format to the specified file-like object, writing all
   objects as a list of objects. The additional ``__init__`` method arguments are
   passed to the :class:`BaseItemExporter` ``__init__`` method, and the leftover
   arguments to the :class:`~json.JSONEncoder` ``__init__`` method, so you can use any
   :class:`~json.JSONEncoder` ``__init__`` method argument to customize this exporter.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   A typical output of this exporter would be::

        [{"name": "Color TV", "price": "1200"},
        {"name": "DVD player", "price": "200"}]

   .. _json-with-large-data:

   .. warning:: JSON is very simple and flexible serialization format, but it
      doesn't scale well for large amounts of data since incremental (aka.
      stream-mode) parsing is not well supported (if at all) among JSON parsers
      (on any language), and most of them just parse the entire object in
      memory. If you want the power and simplicity of JSON with a more
      stream-friendly format, consider using :class:`JsonLinesItemExporter`
      instead, or splitting the output in multiple chunks.

JsonLinesItemExporter
---------------------

.. class:: JsonLinesItemExporter(file, **kwargs)

   Exports items in JSON format to the specified file-like object, writing one
   JSON-encoded item per line. The additional ``__init__`` method arguments are passed
   to the :class:`BaseItemExporter` ``__init__`` method, and the leftover arguments to
   the :class:`~json.JSONEncoder` ``__init__`` method, so you can use any
   :class:`~json.JSONEncoder` ``__init__`` method argument to customize this exporter.

   :param file: the file-like object to use for exporting the data. Its ``write`` method should
                accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc)

   A typical output of this exporter would be::

        {"name": "Color TV", "price": "1200"}
        {"name": "DVD player", "price": "200"}

   Unlike the one produced by :class:`JsonItemExporter`, the format produced by
   this exporter is well suited for serializing large amounts of data.

MarshalItemExporter
-------------------

.. autoclass:: MarshalItemExporter


.. _topics-extensions:

==========
Extensions
==========

Extensions are :ref:`components <topics-components>` that allow inserting your
own custom functionality into Scrapy.

Unlike other components, extensions do not have a specific role in Scrapy. They
are “wildcard” components that can be used for anything that does not fit the
role of any other type of component.

Loading and activating extensions
=================================

Extensions are loaded at startup by creating a single instance of the extension
class per spider being run.

To enable an extension, add it to the :setting:`EXTENSIONS` setting. For
example:

.. code-block:: python

    EXTENSIONS = {
        "scrapy.extensions.corestats.CoreStats": 500,
        "scrapy.extensions.telnet.TelnetConsole": 500,
    }

:setting:`EXTENSIONS` is merged with :setting:`EXTENSIONS_BASE` (not meant to
be overridden), and the priorities in the resulting value determine the
*loading* order.

As extensions typically do not depend on each other, their loading order is
irrelevant in most cases. This is why the :setting:`EXTENSIONS_BASE` setting
defines all extensions with the same order (``0``). However, you may need to
carefully use priorities if you add an extension that depends on other
extensions being already loaded.

Writing your own extension
==========================

Each extension is a :ref:`component <topics-components>`.

Typically, extensions connect to :ref:`signals <topics-signals>` and perform
tasks triggered by them.

Sample extension
----------------

Here we will implement a simple extension to illustrate the concepts described
in the previous section. This extension will log a message every time:

* a spider is opened
* a spider is closed
* a specific number of items are scraped

The extension will be enabled through the ``MYEXT_ENABLED`` setting and the
number of items will be specified through the ``MYEXT_ITEMCOUNT`` setting.

Here is the code of such extension:

.. code-block:: python

    import logging
    from scrapy import signals
    from scrapy.exceptions import NotConfigured

    logger = logging.getLogger(__name__)

    class SpiderOpenCloseLogging:
        def __init__(self, item_count):
            self.item_count = item_count
            self.items_scraped = 0

        @classmethod
        def from_crawler(cls, crawler):
            # first check if the extension should be enabled and raise
            # NotConfigured otherwise
            if not crawler.settings.getbool("MYEXT_ENABLED"):
                raise NotConfigured

            # get the number of items from settings
            item_count = crawler.settings.getint("MYEXT_ITEMCOUNT", 1000)

            # instantiate the extension object
            ext = cls(item_count)

            # connect the extension object to signals
            crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
            crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
            crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

            # return the extension object
            return ext

        def spider_opened(self, spider):
            logger.info("opened spider %s", spider.name)

        def spider_closed(self, spider):
            logger.info("closed spider %s", spider.name)

        def item_scraped(self, item, spider):
            self.items_scraped += 1
            if self.items_scraped % self.item_count == 0:
                logger.info("scraped %d items", self.items_scraped)

.. _topics-extensions-ref:

Built-in extensions reference
=============================

General purpose extensions
--------------------------

Log Stats extension
~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.logstats
   :synopsis: Basic stats logging

.. class:: LogStats

Log basic stats like crawled pages and scraped items.

Core Stats extension
~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.corestats
   :synopsis: Core stats collection

.. class:: CoreStats

Enable the collection of core statistics, provided the stats collection is
enabled (see :ref:`topics-stats`).

.. _topics-extensions-ref-telnetconsole:

Log Count extension
~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.logcount
   :synopsis: Basic stats logging

.. autoclass:: LogCount

Telnet console extension
~~~~~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.telnet
   :synopsis: Telnet console

.. class:: TelnetConsole

Provides a telnet console for getting into a Python interpreter inside the
currently running Scrapy process, which can be very useful for debugging.

The telnet console must be enabled by the :setting:`TELNETCONSOLE_ENABLED`
setting, and the server will listen in the port specified in
:setting:`TELNETCONSOLE_PORT`.

.. _topics-extensions-ref-memusage:

Memory usage extension
~~~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.memusage
   :synopsis: Memory usage extension

.. class:: MemoryUsage

.. note:: This extension does not work in Windows.

Monitors the memory used by the Scrapy process that runs the spider and:

1. sends a :signal:`memusage_warning_reached` signal when it exceeds
   :setting:`MEMUSAGE_WARNING_MB`
2. closes the spider with the `"memusage_exceeded"` reason when it exceeds
   :setting:`MEMUSAGE_LIMIT_MB`

This extension is enabled by the :setting:`MEMUSAGE_ENABLED` setting and
can be configured with the following settings:

* :setting:`MEMUSAGE_LIMIT_MB`
* :setting:`MEMUSAGE_WARNING_MB`
* :setting:`MEMUSAGE_CHECK_INTERVAL_SECONDS`

Memory debugger extension
~~~~~~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.memdebug
   :synopsis: Memory debugger extension

.. class:: MemoryDebugger

An extension for debugging memory usage. It collects information about:

* objects uncollected by the Python garbage collector
* objects left alive that shouldn't. For more info, see :ref:`topics-leaks-trackrefs`

To enable this extension, turn on the :setting:`MEMDEBUG_ENABLED` setting. The
info will be stored in the stats.

.. _topics-extensions-ref-spiderstate:

Spider state extension
~~~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.spiderstate
   :synopsis: Spider state extension

.. class:: SpiderState

Manages spider state data by loading it before a crawl and saving it after.

Give a value to the :setting:`JOBDIR` setting to enable this extension.
When enabled, this extension manages the :attr:`~scrapy.Spider.state`
attribute of your :class:`~scrapy.Spider` instance:

-   When your spider closes (:signal:`spider_closed`), the contents of its
    :attr:`~scrapy.Spider.state` attribute are serialized into a file named
    ``spider.state`` in the :setting:`JOBDIR` folder.
-   When your spider opens (:signal:`spider_opened`), if a previously-generated
    ``spider.state`` file exists in the :setting:`JOBDIR` folder, it is loaded
    into the :attr:`~scrapy.Spider.state` attribute.

For an example, see :ref:`topics-keeping-persistent-state-between-batches`.

Close spider extension
~~~~~~~~~~~~~~~~~~~~~~

.. module:: scrapy.extensions.closespider
   :synopsis: Close spider extension

.. class:: CloseSpider

Closes a spider automatically when some conditions are met, using a specific
closing reason for each condition.

The conditions for closing a spider can be configured through the following
settings:

* :setting:`CLOSESPIDER_TIMEOUT`
* :setting:`CLOSESPIDER_TIMEOUT_NO_ITEM`
* :setting:`CLOSESPIDER_ITEMCOUNT`
* :setting:`CLOSESPIDER_PAGECOUNT`
* :setting:`CLOSESPIDER_ERRORCOUNT`

.. note::

   When a certain closing condition is met, requests which are
   currently in the downloader queue (up to :setting:`CONCURRENT_REQUESTS`
   requests) are still processed.

.. setting:: CLOSESPIDER_TIMEOUT

CLOSESPIDER_TIMEOUT
"""""""""""""""""""

Default: ``0``

An integer which specifies a number of seconds. If the spider remains open for
more than that number of seconds, it will be automatically closed with the
reason ``closespider_timeout``. If zero (or non set), spiders won't be closed by
timeout.

.. setting:: CLOSESPIDER_TIMEOUT_NO_ITEM

CLOSESPIDER_TIMEOUT_NO_ITEM
"""""""""""""""""""""""""""

Default: ``0``

An integer which specifies a number of seconds. If the spider has not produced
any items in the last number of seconds, it will be closed with the reason
``closespider_timeout_no_item``. If zero (or non set), spiders won't be closed
regardless if it hasn't produced any items.

.. setting:: CLOSESPIDER_ITEMCOUNT

CLOSESPIDER_ITEMCOUNT
"""""""""""""""""""""

Default: ``0``

An integer which specifies a number of items. If the spider scrapes more than
that amount and those items are passed by the item pipeline, the
spider will be closed with the reason ``closespider_itemcount``.
If zero (or non set), spiders won't be closed by number of passed items.

.. setting:: CLOSESPIDER_PAGECOUNT

CLOSESPIDER_PAGECOUNT
"""""""""""""""""""""

Default: ``0``

An integer which specifies the maximum number of responses to crawl. If the spider
crawls more than that, the spider will be closed with the reason
``closespider_pagecount``. If zero (or non set), spiders won't be closed by
number of crawled responses.

.. setting:: CLOSESPIDER_PAGECOUNT_NO_ITEM

CLOSESPIDER_PAGECOUNT_NO_ITEM
"""""""""""""""""""""""""""""

Default: ``0``

An integer which specifies the maximum number of consecutive responses to crawl
without items scraped. If the spider crawls more consecutive responses than that
and no items are scraped in the meantime, the spider will be closed with the
reason ``closespider_pagecount_no_item``. If zero (or not set), spiders won't be
closed by number of crawled responses with no items.

.. setting:: CLOSESPIDER_ERRORCOUNT

CLOSESPIDER_ERRORCOUNT
""""""""""""""""""""""

Default: ``0``

An integer which specifies the maximum number of errors to receive before
closing the spider. If the spider generates more than that number of errors,
it will be closed with the reason ``closespider_errorcount``. If zero (or non
set), spiders won't be closed by number of errors.

.. module:: scrapy.extensions.debug
   :synopsis: Extensions for debugging Scrapy

.. module:: scrapy.extensions.periodic_log
   :synopsis: Periodic stats logging

Periodic log extension
~~~~~~~~~~~~~~~~~~~~~~

.. class:: PeriodicLog

This extension periodically logs rich stat data as a JSON object::

    2023-08-04 02:30:57 [scrapy.extensions.logstats] INFO: Crawled 976 pages (at 162 pages/min), scraped 925 items (at 161 items/min)
    2023-08-04 02:30:57 [scrapy.extensions.periodic_log] INFO: {
        "delta": {
            "downloader/request_bytes": 55582,
            "downloader/request_count": 162,
            "downloader/request_method_count/GET": 162,
            "downloader/response_bytes": 618133,
            "downloader/response_count": 162,
            "downloader/response_status_count/200": 162,
            "item_scraped_count": 161
        },
        "stats": {
            "downloader/request_bytes": 338243,
            "downloader/request_count": 992,
            "downloader/request_method_count/GET": 992,
            "downloader/response_bytes": 3836736,
            "downloader/response_count": 976,
            "downloader/response_status_count/200": 976,
            "item_scraped_count": 925,
            "log_count/INFO": 21,
            "log_count/WARNING": 1,
            "scheduler/dequeued": 992,
            "scheduler/dequeued/memory": 992,
            "scheduler/enqueued": 1050,
            "scheduler/enqueued/memory": 1050
        },
        "time": {
            "elapsed": 360.008903,
            "log_interval": 60.0,
            "log_interval_real": 60.006694,
            "start_time": "2023-08-03 23:24:57",
            "utcnow": "2023-08-03 23:30:57"
        }
    }

This extension logs the following configurable sections:

-   ``"delta"`` shows how some numeric stats have changed since the last stats
    log message.

    The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They
    must have ``int`` or ``float`` values.

-   ``"stats"`` shows the current value of some stats.

    The :setting:`PERIODIC_LOG_STATS` setting determines the target stats.

-   ``"time"`` shows detailed timing data.

    The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or
    not to show this section.

This extension logs data at the start, then on a fixed time interval
configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally
right before the crawl ends.

Example extension configuration:

.. code-block:: python

    custom_settings = {
        "LOG_LEVEL": "INFO",
        "PERIODIC_LOG_STATS": {
            "include": ["downloader/", "scheduler/", "log_count/", "item_scraped_count/"],
        },
        "PERIODIC_LOG_DELTA": {"include": ["downloader/"]},
        "PERIODIC_LOG_TIMING_ENABLED": True,
        "EXTENSIONS": {
            "scrapy.extensions.periodic_log.PeriodicLog": 0,
        },
    }

.. setting:: PERIODIC_LOG_DELTA

PERIODIC_LOG_DELTA
""""""""""""""""""

Default: ``None``

* ``"PERIODIC_LOG_DELTA": True`` - show deltas for all ``int`` and ``float`` stat values.
* ``"PERIODIC_LOG_DELTA": {"include": ["downloader/", "scheduler/"]}`` - show deltas for stats with names containing any configured substring.
* ``"PERIODIC_LOG_DELTA": {"exclude": ["downloader/"]}`` - show deltas for all stats with names not containing any configured substring.

.. setting:: PERIODIC_LOG_STATS

PERIODIC_LOG_STATS
""""""""""""""""""

Default: ``None``

* ``"PERIODIC_LOG_STATS": True`` - show the current value of all stats.
* ``"PERIODIC_LOG_STATS": {"include": ["downloader/", "scheduler/"]}`` - show current values for stats with names containing any configured substring.
* ``"PERIODIC_LOG_STATS": {"exclude": ["downloader/"]}`` - show current values for all stats with names not containing any configured substring.

.. setting:: PERIODIC_LOG_TIMING_ENABLED

PERIODIC_LOG_TIMING_ENABLED
"""""""""""""""""""""""""""

Default: ``False``

``True`` enables logging of timing data (i.e. the ``"time"`` section).

Debugging extensions
--------------------

Stack trace dump extension
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. class:: StackTraceDump

Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_
signal is received. The information dumped is the following:

1. engine status (using ``scrapy.utils.engine.get_engine_status()``)
2. live references (see :ref:`topics-leaks-trackrefs`)
3. stack trace of all threads

After the stack trace and engine status is dumped, the Scrapy process continues
running normally.

This extension only works on POSIX-compliant platforms (i.e. not Windows),
because the `SIGQUIT`_ and `SIGUSR2`_ signals are not available on Windows.

There are at least two ways to send Scrapy the `SIGQUIT`_ signal:

1. By pressing Ctrl-\ while a Scrapy process is running (Linux only?)
2. By running this command (assuming ``<pid>`` is the process id of the Scrapy
   process)::

    kill -QUIT <pid>

.. _SIGUSR2: https://en.wikipedia.org/wiki/SIGUSR1_and_SIGUSR2
.. _SIGQUIT: https://en.wikipedia.org/wiki/SIGQUIT

Debugger extension
~~~~~~~~~~~~~~~~~~

.. class:: Debugger

Invokes a :doc:`Python debugger <library/pdb>` inside a running Scrapy process when a `SIGUSR2`_
signal is received. After the debugger is exited, the Scrapy process continues
running normally.

This extension only works on POSIX-compliant platforms (i.e. not Windows).


.. _topics-feed-exports:

============
Feed exports
============

One of the most frequently required features when implementing scrapers is
being able to store the scraped data properly and, quite often, that means
generating an "export file" with the scraped data (commonly called "export
feed") to be consumed by other systems.

Scrapy provides this functionality out of the box with the Feed Exports, which
allows you to generate feeds with the scraped items, using multiple
serialization formats and storage backends.

This page provides detailed documentation for all feed export features. If you
are looking for a step-by-step guide, check out `Zyte’s export guides`_.

.. _Zyte’s export guides: https://docs.zyte.com/web-scraping/guides/export/index.html#exporting-scraped-data

.. _topics-feed-format:

Serialization formats
=====================

For serializing the scraped data, the feed exports use the :ref:`Item exporters
<topics-exporters>`. These formats are supported out of the box:

-   :ref:`topics-feed-format-json`
-   :ref:`topics-feed-format-jsonlines`
-   :ref:`topics-feed-format-csv`
-   :ref:`topics-feed-format-xml`

But you can also extend the supported format through the
:setting:`FEED_EXPORTERS` setting.

.. _topics-feed-format-json:

JSON
----

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``json``

-   Exporter used: :class:`~scrapy.exporters.JsonItemExporter`

-   See :ref:`this warning <json-with-large-data>` if you're using JSON with
    large feeds.

.. _topics-feed-format-jsonlines:

JSON lines
----------

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``jsonlines``
-   Exporter used: :class:`~scrapy.exporters.JsonLinesItemExporter`

.. _topics-feed-format-csv:

CSV
---

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``csv``

-   Exporter used: :class:`~scrapy.exporters.CsvItemExporter`

-   To specify columns to export, their order and their column names, use
    :setting:`FEED_EXPORT_FIELDS`. Other feed exporters can also use this
    option, but it is important for CSV because unlike many other export
    formats CSV uses a fixed header.

.. _topics-feed-format-xml:

XML
---

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``xml``
-   Exporter used: :class:`~scrapy.exporters.XmlItemExporter`

.. _topics-feed-format-pickle:

Pickle
------

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``pickle``
-   Exporter used: :class:`~scrapy.exporters.PickleItemExporter`

.. _topics-feed-format-marshal:

Marshal
-------

-   Value for the ``format`` key in the :setting:`FEEDS` setting: ``marshal``
-   Exporter used: :class:`~scrapy.exporters.MarshalItemExporter`

.. _topics-feed-storage:

Storages
========

When using the feed exports you define where to store the feed using one or multiple URIs_
(through the :setting:`FEEDS` setting). The feed exports supports multiple
storage backend types which are defined by the URI scheme.

The storages backends supported out of the box are:

-   :ref:`topics-feed-storage-fs`
-   :ref:`topics-feed-storage-ftp`
-   :ref:`topics-feed-storage-s3` (requires boto3_)
-   :ref:`topics-feed-storage-gcs` (requires `google-cloud-storage`_)
-   :ref:`topics-feed-storage-stdout`

Some storage backends may be unavailable if the required external libraries are
not available. For example, the S3 backend is only available if the boto3_
library is installed.

.. _topics-feed-uri-params:

Storage URI parameters
======================

The storage URI can also contain parameters that get replaced when the feed is
being created. These parameters are:

-   ``%(time)s`` - gets replaced by a timestamp when the feed is being created
-   ``%(name)s`` - gets replaced by the spider name

Any other named parameter gets replaced by the spider attribute of the same
name. For example, ``%(site_id)s`` would get replaced by the ``spider.site_id``
attribute the moment the feed is being created.

Here are some examples to illustrate:

-   Store in FTP using one directory per spider:

    -   ``ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json``

-   Store in S3 using one directory per spider:

    -   ``s3://mybucket/scraping/feeds/%(name)s/%(time)s.json``

.. note:: :ref:`Spider arguments <spiderargs>` become spider attributes, hence
          they can also be used as storage URI parameters.

.. _topics-feed-storage-backends:

Storage backends
================

.. _topics-feed-storage-fs:

Local filesystem
----------------

The feeds are stored in the local filesystem.

-   URI scheme: ``file``
-   Example URI: ``file:///tmp/export.csv``
-   Required external libraries: none

Note that for the local filesystem storage (only) you can omit the scheme if
you specify an absolute path like ``/tmp/export.csv`` (Unix systems only).
Alternatively you can also use a :class:`pathlib.Path` object.

.. _topics-feed-storage-ftp:

FTP
---

The feeds are stored in a FTP server.

-   URI scheme: ``ftp``
-   Example URI: ``ftp://user:pass@ftp.example.com/path/to/export.csv``
-   Required external libraries: none

FTP supports two different connection modes: `active or passive
<https://stackoverflow.com/a/1699163>`_. Scrapy uses the passive connection
mode by default. To use the active connection mode instead, set the
:setting:`FEED_STORAGE_FTP_ACTIVE` setting to ``True``.

The default value for the ``overwrite`` key in the :setting:`FEEDS` for this
storage backend is: ``True``.

.. caution:: The value ``True`` in ``overwrite`` will cause you to lose the
     previous version of your data.

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.

.. _topics-feed-storage-s3:

S3
--

The feeds are stored on `Amazon S3`_.

-   URI scheme: ``s3``

-   Example URIs:

    -   ``s3://mybucket/path/to/export.csv``

    -   ``s3://aws_key:aws_secret@mybucket/path/to/export.csv``

-   Required external libraries: `boto3`_ >= 1.20.0

The AWS credentials can be passed as user/password in the URI, or they can be
passed through the following settings:

-   :setting:`AWS_ACCESS_KEY_ID`
-   :setting:`AWS_SECRET_ACCESS_KEY`
-   :setting:`AWS_SESSION_TOKEN` (only needed for `temporary security credentials`_)

.. _temporary security credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html

You can also define a custom ACL, custom endpoint, and region name for exported
feeds using these settings:

-   :setting:`FEED_STORAGE_S3_ACL`
-   :setting:`AWS_ENDPOINT_URL`
-   :setting:`AWS_REGION_NAME`

The default value for the ``overwrite`` key in the :setting:`FEEDS` for this
storage backend is: ``True``.

.. caution:: The value ``True`` in ``overwrite`` will cause you to lose the
     previous version of your data.

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.

.. _topics-feed-storage-gcs:

Google Cloud Storage (GCS)
--------------------------

The feeds are stored on `Google Cloud Storage`_.

-   URI scheme: ``gs``

-   Example URIs:

    -   ``gs://mybucket/path/to/export.csv``

-   Required external libraries: `google-cloud-storage`_.

For more information about authentication, please refer to `Google Cloud documentation <https://docs.cloud.google.com/docs/authentication>`_.

You can set a *Project ID* and *Access Control List (ACL)* through the following settings:

-   :setting:`FEED_STORAGE_GCS_ACL`
-   :setting:`GCS_PROJECT_ID`

The default value for the ``overwrite`` key in the :setting:`FEEDS` for this
storage backend is: ``True``.

.. caution:: The value ``True`` in ``overwrite`` will cause you to lose the
     previous version of your data.

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.

.. _google-cloud-storage: https://docs.cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python

.. _topics-feed-storage-stdout:

Standard output
---------------

The feeds are written to the standard output of the Scrapy process.

-   URI scheme: ``stdout``
-   Example URI: ``stdout:``
-   Required external libraries: none

.. _delayed-file-delivery:

Delayed file delivery
---------------------

As indicated above, some of the described storage backends use delayed file
delivery.

These storage backends do not upload items to the feed URI as those items are
scraped. Instead, Scrapy writes items into a temporary local file, and only
once all the file contents have been written (i.e. at the end of the crawl) is
that file uploaded to the feed URI.

If you want item delivery to start earlier when using one of these storage
backends, use :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` to split the output items
in multiple files, with the specified maximum item count per file. That way, as
soon as a file reaches the maximum item count, that file is delivered to the
feed URI, allowing item delivery to start way before the end of the crawl.

.. _item-filter:

Item filtering
==============

You can filter items that you want to allow for a particular feed by using the
``item_classes`` option in :ref:`feeds options <feed-options>`. Only items of
the specified types will be added to the feed.

The ``item_classes`` option is implemented by the :class:`~scrapy.extensions.feedexport.ItemFilter`
class, which is the default value of the ``item_filter`` :ref:`feed option <feed-options>`.

You can create your own custom filtering class by implementing :class:`~scrapy.extensions.feedexport.ItemFilter`'s
method ``accepts`` and taking ``feed_options`` as an argument.

For instance:

.. code-block:: python

    class MyCustomFilter:
        def __init__(self, feed_options):
            self.feed_options = feed_options

        def accepts(self, item):
            if "field1" in item and item["field1"] == "expected_data":
                return True
            return False

You can assign your custom filtering class to the ``item_filter`` :ref:`option of a feed <feed-options>`.
See :setting:`FEEDS` for examples.

ItemFilter
----------

.. autoclass:: scrapy.extensions.feedexport.ItemFilter
   :members:

.. _post-processing:

Post-Processing
===============

Scrapy provides an option to activate plugins to post-process feeds before they are exported
to feed storages. In addition to using :ref:`builtin plugins <builtin-plugins>`, you
can create your own :ref:`plugins <custom-plugins>`.

These plugins can be activated through the ``postprocessing`` option of a feed.
The option must be passed a list of post-processing plugins in the order you want
the feed to be processed. These plugins can be declared either as an import string
or with the imported class of the plugin. Parameters to plugins can be passed
through the feed options. See :ref:`feed options <feed-options>` for examples.

.. _builtin-plugins:

Built-in Plugins
----------------

.. autoclass:: scrapy.extensions.postprocessing.GzipPlugin

.. autoclass:: scrapy.extensions.postprocessing.LZMAPlugin

.. autoclass:: scrapy.extensions.postprocessing.Bz2Plugin

.. _custom-plugins:

Custom Plugins
--------------

Each plugin is a class that must implement the following methods:

.. method:: __init__(self, file, feed_options)

    Initialize the plugin.

    :param file: file-like object having at least the `write`, `tell` and `close` methods implemented

    :param feed_options: feed-specific :ref:`options <feed-options>`
    :type feed_options: :class:`dict`

.. method:: write(self, data)

   Process and write `data` (:class:`bytes` or :class:`memoryview`) into the plugin's target file.
   It must return number of bytes written.

.. method:: close(self)

    Clean up the plugin.

    For example, you might want to close a file wrapper that you might have
    used to compress data written into the file received in the ``__init__``
    method.

    .. warning:: Do not close the file from the ``__init__`` method.

To pass a parameter to your plugin, use :ref:`feed options <feed-options>`. You
can then access those parameters from the ``__init__`` method of your plugin.

Settings
========

These are the settings used for configuring the feed exports:

-   :setting:`FEEDS` (mandatory)
-   :setting:`FEED_EXPORT_ENCODING`
-   :setting:`FEED_STORE_EMPTY`
-   :setting:`FEED_EXPORT_FIELDS`
-   :setting:`FEED_EXPORT_INDENT`
-   :setting:`FEED_STORAGES`
-   :setting:`FEED_STORAGE_FTP_ACTIVE`
-   :setting:`FEED_STORAGE_S3_ACL`
-   :setting:`FEED_EXPORTERS`
-   :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`

.. currentmodule:: scrapy.extensions.feedexport

.. setting:: FEEDS

FEEDS
-----

Default: ``{}``

A dictionary in which every key is a feed URI (or a :class:`pathlib.Path`
object) and each value is a nested dictionary containing configuration
parameters for the specific feed.

This setting is required for enabling the feed export feature.

See :ref:`topics-feed-storage-backends` for supported URI schemes.

For instance::

    {
        'items.json': {
            'format': 'json',
            'encoding': 'utf8',
            'store_empty': False,
            'item_classes': [MyItemClass1, 'myproject.items.MyItemClass2'],
            'fields': None,
            'indent': 4,
            'item_export_kwargs': {
               'export_empty_fields': True,
            },
        },
        '/home/user/documents/items.xml': {
            'format': 'xml',
            'fields': ['name', 'price'],
            'item_filter': MyCustomFilter1,
            'encoding': 'latin1',
            'indent': 8,
        },
        pathlib.Path('items.csv.gz'): {
            'format': 'csv',
            'fields': ['price', 'name'],
            'item_filter': 'myproject.filters.MyCustomFilter2',
            'postprocessing': [MyPlugin1, 'scrapy.extensions.postprocessing.GzipPlugin'],
            'gzip_compresslevel': 5,
        },
    }

.. _feed-options:

The following is a list of the accepted keys and the setting that is used
as a fallback value if that key is not provided for a specific feed definition:

-   ``format``: the :ref:`serialization format <topics-feed-format>`.

    This setting is mandatory, there is no fallback value.

-   ``batch_item_count``: falls back to
    :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`.

-   ``encoding``: falls back to :setting:`FEED_EXPORT_ENCODING`.

-   ``fields``: falls back to :setting:`FEED_EXPORT_FIELDS`.

-   ``item_classes``: list of :ref:`item classes <topics-items>` to export.

    If undefined or empty, all items are exported.

-   ``item_filter``: a :ref:`filter class <item-filter>` to filter items to export.

    :class:`~scrapy.extensions.feedexport.ItemFilter` is used be default.

-   ``indent``: falls back to :setting:`FEED_EXPORT_INDENT`.

-   ``item_export_kwargs``: :class:`dict` with keyword arguments for the corresponding :ref:`item exporter class <topics-exporters>`.

-   ``overwrite``: whether to overwrite the file if it already exists
    (``True``) or append to its content (``False``).

    The default value depends on the :ref:`storage backend
    <topics-feed-storage-backends>`:

    -   :ref:`topics-feed-storage-fs`: ``False``

    -   :ref:`topics-feed-storage-ftp`: ``True``

        .. note:: Some FTP servers may not support appending to files (the
                  ``APPE`` FTP command).

    -   :ref:`topics-feed-storage-s3`: ``True`` (appending is not supported)

    -   :ref:`topics-feed-storage-gcs`: ``True`` (appending is not supported)

    -   :ref:`topics-feed-storage-stdout`: ``False`` (overwriting is not supported)

-   ``store_empty``: falls back to :setting:`FEED_STORE_EMPTY`.

-   ``uri_params``: falls back to :setting:`FEED_URI_PARAMS`.

-   ``postprocessing``: list of :ref:`plugins <post-processing>` to use for post-processing.

    The plugins will be used in the order of the list passed.

.. setting:: FEED_EXPORT_ENCODING

FEED_EXPORT_ENCODING
--------------------

Default: ``"utf-8"`` (:ref:`fallback <default-settings>`: ``None``)

The encoding to be used for the feed.

If set to ``None``, it uses UTF-8 for everything except JSON output, which uses
safe numeric encoding (``\uXXXX`` sequences) for historic reasons.

Use ``"utf-8"`` if you want UTF-8 for JSON too.

.. versionchanged:: 2.8
   The :command:`startproject` command now sets this setting to
   ``"utf-8"`` in the generated ``settings.py`` file.

.. setting:: FEED_EXPORT_FIELDS

FEED_EXPORT_FIELDS
------------------

Default: ``None``

Use the ``FEED_EXPORT_FIELDS`` setting to define the fields to export, their
order and their output names. See :attr:`BaseItemExporter.fields_to_export
<scrapy.exporters.BaseItemExporter.fields_to_export>` for more information.

.. setting:: FEED_EXPORT_INDENT

FEED_EXPORT_INDENT
------------------

Default: ``0``

Amount of spaces used to indent the output on each level. If ``FEED_EXPORT_INDENT``
is a non-negative integer, then array elements and object members will be pretty-printed
with that indent level. An indent level of ``0`` (the default), or negative,
will put each item on a new line. ``None`` selects the most compact representation.

Currently implemented only by :class:`~scrapy.exporters.JsonItemExporter`
and :class:`~scrapy.exporters.XmlItemExporter`, i.e. when you are exporting
to ``.json`` or ``.xml``.

.. setting:: FEED_STORE_EMPTY

FEED_STORE_EMPTY
----------------

Default: ``True``

Whether to export empty feeds (i.e. feeds with no items).
If ``False``, and there are no items to export, no new files are created and
existing files are not modified, even if the :ref:`overwrite feed option
<feed-options>` is enabled.

.. setting:: FEED_STORAGES

FEED_STORAGES
-------------

Default: ``{}``

A dict containing additional feed storage backends supported by your project.
The keys are URI schemes and the values are paths to storage classes.

.. setting:: FEED_STORAGE_FTP_ACTIVE

FEED_STORAGE_FTP_ACTIVE
-----------------------

Default: ``False``

Whether to use the active connection mode when exporting feeds to an FTP server
(``True``) or use the passive connection mode instead (``False``, default).

For information about FTP connection modes, see `What is the difference between
active and passive FTP? <https://stackoverflow.com/a/1699163>`_.

.. setting:: FEED_STORAGE_S3_ACL

FEED_STORAGE_S3_ACL
-------------------

Default: ``''`` (empty string)

A string containing a custom ACL for feeds exported to Amazon S3 by your project.

For a complete list of available values, access the `Canned ACL`_ section on Amazon S3 docs.

.. setting:: FEED_STORAGES_BASE

FEED_STORAGES_BASE
------------------

Default:

.. code-block:: python

    {
        "": "scrapy.extensions.feedexport.FileFeedStorage",
        "file": "scrapy.extensions.feedexport.FileFeedStorage",
        "stdout": "scrapy.extensions.feedexport.StdoutFeedStorage",
        "s3": "scrapy.extensions.feedexport.S3FeedStorage",
        "ftp": "scrapy.extensions.feedexport.FTPFeedStorage",
    }

A dict containing the built-in feed storage backends supported by Scrapy. You
can disable any of these backends by assigning ``None`` to their URI scheme in
:setting:`FEED_STORAGES`. E.g., to disable the built-in FTP storage backend
(without replacement), place this in your ``settings.py``:

.. code-block:: python

    FEED_STORAGES = {
        "ftp": None,
    }

.. setting:: FEED_EXPORTERS

FEED_EXPORTERS
--------------

Default: ``{}``

A dict containing additional exporters supported by your project. The keys are
serialization formats and the values are paths to :ref:`Item exporter
<topics-exporters>` classes.

.. setting:: FEED_EXPORTERS_BASE

FEED_EXPORTERS_BASE
-------------------
Default:

.. code-block:: python

    {
        "json": "scrapy.exporters.JsonItemExporter",
        "jsonlines": "scrapy.exporters.JsonLinesItemExporter",
        "jsonl": "scrapy.exporters.JsonLinesItemExporter",
        "jl": "scrapy.exporters.JsonLinesItemExporter",
        "csv": "scrapy.exporters.CsvItemExporter",
        "xml": "scrapy.exporters.XmlItemExporter",
        "marshal": "scrapy.exporters.MarshalItemExporter",
        "pickle": "scrapy.exporters.PickleItemExporter",
    }

A dict containing the built-in feed exporters supported by Scrapy. You can
disable any of these exporters by assigning ``None`` to their serialization
format in :setting:`FEED_EXPORTERS`. E.g., to disable the built-in CSV exporter
(without replacement), place this in your ``settings.py``:

.. code-block:: python

    FEED_EXPORTERS = {
        "csv": None,
    }

.. setting:: FEED_EXPORT_BATCH_ITEM_COUNT

FEED_EXPORT_BATCH_ITEM_COUNT
----------------------------

Default: ``0``

If assigned an integer number higher than ``0``, Scrapy generates multiple output files
storing up to the specified number of items in each output file.

When generating multiple output files, you must use at least one of the following
placeholders in the feed URI to indicate how the different output file names are
generated:

* ``%(batch_time)s`` - gets replaced by a timestamp when the feed is being created
  (e.g. ``2020-03-28T14-45-08.237134``)

* ``%(batch_id)d`` - gets replaced by the 1-based sequence number of the batch.

  Use :ref:`printf-style string formatting <python:old-string-formatting>` to
  alter the number format. For example, to make the batch ID a 5-digit
  number by introducing leading zeroes as needed, use ``%(batch_id)05d``
  (e.g. ``3`` becomes ``00003``, ``123`` becomes ``00123``).

For instance, if your settings include:

.. code-block:: python

    FEED_EXPORT_BATCH_ITEM_COUNT = 100

And your :command:`crawl` command line is::

    scrapy crawl spidername -o "dirname/%(batch_id)d-filename%(batch_time)s.json"

The command line above can generate a directory tree like::

    ->projectname
    -->dirname
    --->1-filename2020-03-28T14-45-08.237134.json
    --->2-filename2020-03-28T14-45-09.148903.json
    --->3-filename2020-03-28T14-45-10.046092.json

Where the first and second files contain exactly 100 items. The last one contains
100 items or fewer.

.. setting:: FEED_URI_PARAMS

FEED_URI_PARAMS
---------------

Default: ``None``

A string with the import path of a function to set the parameters to apply with
:ref:`printf-style string formatting <python:old-string-formatting>` to the
feed URI.

The function signature should be as follows:

.. function:: uri_params(params, spider)

   Return a :class:`dict` of key-value pairs to apply to the feed URI using
   :ref:`printf-style string formatting <python:old-string-formatting>`.

   :param params: default key-value pairs

        Specifically:

        -   ``batch_id``: ID of the file batch. See
            :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`.

            If :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` is ``0``, ``batch_id``
            is always ``1``.

        -   ``batch_time``: UTC date and time, in ISO format with ``:``
            replaced with ``-``.

            See :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`.

        -   ``time``: ``batch_time``, with microseconds set to ``0``.
   :type params: dict

   :param spider: source spider of the feed items
   :type spider: scrapy.Spider

   .. caution:: The function should return a new dictionary, modifying
                the received ``params`` in-place is deprecated.

For example, to include the :attr:`name <scrapy.Spider.name>` of the
source spider in the feed URI:

#.  Define the following function somewhere in your project:

    .. code-block:: python

        # myproject/utils.py
        def uri_params(params, spider):
            return {**params, "spider_name": spider.name}

#.  Point :setting:`FEED_URI_PARAMS` to that function in your settings:

    .. code-block:: python

        # myproject/settings.py
        FEED_URI_PARAMS = "myproject.utils.uri_params"

#.  Use ``%(spider_name)s`` in your feed URI::

        scrapy crawl <spider_name> -o "%(spider_name)s.jsonl"

.. _URIs: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
.. _Amazon S3: https://aws.amazon.com/s3/
.. _boto3: https://github.com/boto/boto3
.. _Canned ACL: https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html#canned-acl
.. _Google Cloud Storage: https://cloud.google.com/storage/


.. _topics-item-pipeline:

=============
Item Pipeline
=============

After an item has been scraped by a spider, it is sent to the Item Pipeline
which processes it through several components that are executed sequentially.

Each item pipeline component (sometimes referred as just "Item Pipeline") is a
Python class that implements a simple method. They receive an item and perform
an action over it, also deciding if the item should continue through the
pipeline or be dropped and no longer processed.

Typical uses of item pipelines are:

* cleansing HTML data
* validating scraped data (checking that the items contain certain fields)
* checking for duplicates (and dropping them)
* storing the scraped item in a database

Writing your own item pipeline
==============================

Each item pipeline is a :ref:`component <topics-components>` that must
implement the following method:

.. method:: process_item(self, item)

   This method is called for every item pipeline component.

   `item` is an :ref:`item object <item-types>`, see
   :ref:`supporting-item-types`.

   :meth:`process_item` must either return an :ref:`item object <item-types>`
   or raise a :exc:`~scrapy.exceptions.DropItem` exception.

   Dropped items are no longer processed by further pipeline components.

   :param item: the scraped item
   :type item: :ref:`item object <item-types>`

Additionally, they may also implement the following methods:

.. method:: open_spider(self)

   This method is called when the spider is opened.

.. method:: close_spider(self)

   This method is called when the spider is closed.

Any of these methods may be defined as a coroutine function (``async def``).

Item pipeline example
=====================

Price validation and dropping items with no prices
--------------------------------------------------

Let's take a look at the following hypothetical pipeline that adjusts the
``price`` attribute for those items that do not include VAT
(``price_excludes_vat`` attribute), and drops those items which don't
contain a price:

.. code-block:: python

    from itemadapter import ItemAdapter
    from scrapy.exceptions import DropItem

    class PricePipeline:
        vat_factor = 1.15

        def process_item(self, item):
            adapter = ItemAdapter(item)
            if adapter.get("price"):
                if adapter.get("price_excludes_vat"):
                    adapter["price"] = adapter["price"] * self.vat_factor
                return item
            else:
                raise DropItem("Missing price")

Write items to a JSON lines file
--------------------------------

The following pipeline stores all scraped items (from all spiders) into a
single ``items.jsonl`` file, containing one item per line serialized in JSON
format:

.. code-block:: python

   import json

   from itemadapter import ItemAdapter

   class JsonWriterPipeline:
       def open_spider(self):
           self.file = open("items.jsonl", "w")

       def close_spider(self):
           self.file.close()

       def process_item(self, item):
           line = json.dumps(ItemAdapter(item).asdict()) + "\n"
           self.file.write(line)
           return item

.. note:: The purpose of JsonWriterPipeline is just to introduce how to write
   item pipelines. If you really want to store all scraped items into a JSON
   file you should use the :ref:`Feed exports <topics-feed-exports>`.

Write items to MongoDB
----------------------

In this example we'll write items to MongoDB_ using pymongo_.
MongoDB address and database name are specified in Scrapy settings;
MongoDB collection is named after item class.

The main point of this example is to show how to :ref:`get the crawler
<from-crawler>` and how to clean up the resources properly.

.. skip: next
.. code-block:: python

    import pymongo
    from itemadapter import ItemAdapter

    class MongoPipeline:
        collection_name = "scrapy_items"

        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db

        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get("MONGO_URI"),
                mongo_db=crawler.settings.get("MONGO_DATABASE", "items"),
            )

        def open_spider(self):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]

        def close_spider(self):
            self.client.close()

        def process_item(self, item):
            self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
            return item

.. _MongoDB: https://www.mongodb.com/
.. _pymongo: https://pymongo.readthedocs.io/en/stable/

.. _ScreenshotPipeline:

Take screenshot of item
-----------------------

This example demonstrates how to use :doc:`coroutine syntax <coroutines>` in
the :meth:`process_item` method.

This item pipeline makes a request to a locally-running instance of Splash_ to
render a screenshot of the item URL. After the request response is downloaded,
the item pipeline saves the screenshot to a file and adds the filename to the
item.

.. code-block:: python

    import hashlib
    from pathlib import Path
    from urllib.parse import quote

    import scrapy
    from itemadapter import ItemAdapter
    from scrapy.http.request import NO_CALLBACK

    class ScreenshotPipeline:
        """Pipeline that uses Splash to render screenshot of
        every Scrapy item."""

        SPLASH_URL = "http://localhost:8050/render.png?url={}"

        def __init__(self, crawler):
            self.crawler = crawler

        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)

        async def process_item(self, item):
            adapter = ItemAdapter(item)
            encoded_item_url = quote(adapter["url"])
            screenshot_url = self.SPLASH_URL.format(encoded_item_url)
            request = scrapy.Request(screenshot_url, callback=NO_CALLBACK)
            response = await self.crawler.engine.download_async(request)

            if response.status != 200:
                # Error happened, return item.
                return item

            # Save screenshot to file, filename will be hash of url.
            url = adapter["url"]
            url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
            filename = f"{url_hash}.png"
            Path(filename).write_bytes(response.body)

            # Store filename in item.
            adapter["screenshot_filename"] = filename
            return item

.. _Splash: https://splash.readthedocs.io/en/stable/

Duplicates filter
-----------------

A filter that looks for duplicate items, and drops those items that were
already processed. Let's say that our items have a unique id, but our spider
returns multiples items with the same id:

.. code-block:: python

    from itemadapter import ItemAdapter
    from scrapy.exceptions import DropItem

    class DuplicatesPipeline:
        def __init__(self):
            self.ids_seen = set()

        def process_item(self, item):
            adapter = ItemAdapter(item)
            if adapter["id"] in self.ids_seen:
                raise DropItem(f"Item ID already seen: {adapter['id']}")
            else:
                self.ids_seen.add(adapter["id"])
                return item

Activating an Item Pipeline component
=====================================

To activate an Item Pipeline component you must add its class to the
:setting:`ITEM_PIPELINES` setting, like in the following example:

.. code-block:: python

   ITEM_PIPELINES = {
       "myproject.pipelines.PricePipeline": 300,
       "myproject.pipelines.JsonWriterPipeline": 800,
   }

The integer values you assign to classes in this setting determine the
order in which they run: items go through from lower valued to higher
valued classes. It's customary to define these numbers in the 0-1000 range.


.. _topics-items:

=====
Items
=====

.. module:: scrapy.item
   :synopsis: Item and Field classes

The main goal in scraping is to extract structured data from unstructured
sources, typically, web pages. :ref:`Spiders <topics-spiders>` may return the
extracted data as `items`, Python objects that define key-value pairs.

Scrapy supports :ref:`multiple types of items <item-types>`. When you create an
item, you may use whichever type of item you want. When you write code that
receives an item, your code should :ref:`work for any item type
<supporting-item-types>`.

.. _item-types:

Item Types
==========

Scrapy supports the following types of items, via the `itemadapter`_ library:
:ref:`dictionaries <dict-items>`, :ref:`Item objects <item-objects>`,
:ref:`dataclass objects <dataclass-items>`, and :ref:`attrs objects <attrs-items>`.

.. _itemadapter: https://github.com/scrapy/itemadapter

.. _dict-items:

Dictionaries
------------

As an item type, :class:`dict` is convenient and familiar.

.. _item-objects:

Item objects
------------

:class:`Item` provides a :class:`dict`-like API plus additional features that
make it the most feature-complete item type:

.. autoclass:: scrapy.Item
   :members: copy, deepcopy, fields
   :undoc-members:

:class:`Item` objects replicate the standard :class:`dict` API, including
its ``__init__`` method.

:class:`Item` allows the defining of field names, so that:

-   :class:`KeyError` is raised when using undefined field names (i.e.
    prevents typos going unnoticed)

-   :ref:`Item exporters <topics-exporters>` can export all fields by
    default even if the first scraped object does not have values for all
    of them

:class:`Item` also allows the defining of field metadata, which can be used to
:ref:`customize serialization <topics-exporters-field-serialization>`.

:mod:`trackref` tracks :class:`Item` objects to help find memory leaks
(see :ref:`topics-leaks-trackrefs`).

Example:

.. code-block:: python

    from scrapy.item import Item, Field

    class CustomItem(Item):
        one_field = Field()
        another_field = Field()

.. _dataclass-items:

Dataclass objects
-----------------

:func:`~dataclasses.dataclass` allows the defining of item classes with field names,
so that :ref:`item exporters <topics-exporters>` can export all fields by
default even if the first scraped object does not have values for all of them.

Additionally, ``dataclass`` items also allow you to:

* define the type and default value of each defined field.

* define custom field metadata through :func:`dataclasses.field`, which can be used to
  :ref:`customize serialization <topics-exporters-field-serialization>`.

Example:

.. code-block:: python

    from dataclasses import dataclass

    @dataclass
    class CustomItem:
        one_field: str
        another_field: int

.. note:: Field types are not enforced at run time.

.. _attrs-items:

attr.s objects
--------------

:func:`attr.s` allows the defining of item classes with field names,
so that :ref:`item exporters <topics-exporters>` can export all fields by
default even if the first scraped object does not have values for all of them.

Additionally, ``attr.s`` items also allow to:

* define the type and default value of each defined field.

* define custom field :ref:`metadata <attrs:metadata>`, which can be used to
  :ref:`customize serialization <topics-exporters-field-serialization>`.

In order to use this type, the :doc:`attrs package <attrs:index>` needs to be installed.

Example:

.. code-block:: python

    import attr

    @attr.s
    class CustomItem:
        one_field = attr.ib()
        another_field = attr.ib()

.. _pydantic-items:

Pydantic models
---------------

`Pydantic <https://docs.pydantic.dev/>`_ models allow the defining of item
classes with field names, so that :ref:`item exporters <topics-exporters>` can
export all fields by default even if the first scraped object does not have
values for all of them.

Additionally, ``pydantic`` items also allow you to:

* define the type and default value of each defined field with run-time type
  validation.

* define custom field metadata through `pydantic.Field
  <https://docs.pydantic.dev/latest/concepts/fields/>`_, which can be used to
  :ref:`customize serialization <topics-exporters-field-serialization>`.

* benefit from automatic data validation and conversion based on type
  annotations.

In order to use this type, the `pydantic package <https://docs.pydantic.dev/>`_
needs to be installed.

Example:

.. code-block:: python

    from pydantic import BaseModel, Field

    class CustomItem(BaseModel):
        one_field: str = Field(default="", description="First field")
        another_field: int = Field(default=0, description="Second field")

.. note:: Unlike other item types, Pydantic models enforce field types at
    run time and will raise validation errors for invalid data types.

Working with Item objects
=========================

.. _topics-items-declaring:

Declaring Item subclasses
-------------------------

Item subclasses are declared using a simple class definition syntax and
:class:`Field` objects. Here is an example:

.. code-block:: python

    import scrapy

    class Product(scrapy.Item):
        name = scrapy.Field()
        price = scrapy.Field()
        stock = scrapy.Field()
        tags = scrapy.Field()
        last_updated = scrapy.Field(serializer=str)

.. note:: Those familiar with `Django`_ will notice that Scrapy Items are
   declared similar to `Django Models`_, except that Scrapy Items are much
   simpler as there is no concept of different field types.

.. _Django: https://www.djangoproject.com/
.. _Django Models: https://docs.djangoproject.com/en/dev/topics/db/models/

.. _topics-items-fields:

Declaring fields
----------------

:class:`Field` objects are used to specify metadata for each field. For
example, the serializer function for the ``last_updated`` field illustrated in
the example above.

You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there is no reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different component, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.

It's important to note that the :class:`Field` objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accessed through
the :attr:`~scrapy.Item.fields` attribute.

.. autoclass:: scrapy.Field

    The :class:`Field` class is just an alias to the built-in :class:`dict` class and
    doesn't provide any extra functionality or attributes. In other words,
    :class:`Field` objects are plain-old Python dicts. A separate class is used
    to support the :ref:`item declaration syntax <topics-items-declaring>`
    based on class attributes.

.. note:: Field metadata can also be declared for ``dataclass`` and ``attrs``
    items. Please refer to the documentation for `dataclasses.field`_ and
    `attr.ib`_ for additional information.

    .. _dataclasses.field: https://docs.python.org/3/library/dataclasses.html#dataclasses.field
    .. _attr.ib: https://www.attrs.org/en/stable/api-attr.html#attr.ib

Working with Item objects
-------------------------

.. skip: start

Here are some examples of common tasks performed with items, using the
``Product`` item :ref:`declared above  <topics-items-declaring>`. You will
notice the API is very similar to the :class:`dict` API.

Creating items
''''''''''''''

.. code-block:: pycon

    >>> product = Product(name="Desktop PC", price=1000)
    >>> print(product)
    Product(name='Desktop PC', price=1000)

Getting field values
''''''''''''''''''''

.. code-block:: pycon

    >>> product["name"]
    Desktop PC
    >>> product.get("name")
    Desktop PC

    >>> product["price"]
    1000

    >>> product["last_updated"]
    Traceback (most recent call last):
        ...
    KeyError: 'last_updated'

    >>> product.get("last_updated", "not set")
    not set

    >>> product["lala"]  # getting unknown field
    Traceback (most recent call last):
        ...
    KeyError: 'lala'

    >>> product.get("lala", "unknown field")
    'unknown field'

    >>> "name" in product  # is name field populated?
    True

    >>> "last_updated" in product  # is last_updated populated?
    False

    >>> "last_updated" in product.fields  # is last_updated a declared field?
    True

    >>> "lala" in product.fields  # is lala a declared field?
    False

Setting field values
''''''''''''''''''''

.. code-block:: pycon

    >>> product["last_updated"] = "today"
    >>> product["last_updated"]
    today

    >>> product["lala"] = "test"  # setting unknown field
    Traceback (most recent call last):
        ...
    KeyError: 'Product does not support field: lala'

Accessing all populated values
''''''''''''''''''''''''''''''

To access all populated values, just use the typical :class:`dict` API:

.. code-block:: pycon

    >>> product.keys()
    ['price', 'name']

    >>> product.items()
    [('price', 1000), ('name', 'Desktop PC')]

.. _copying-items:

Copying items
'''''''''''''

To copy an item, you must first decide whether you want a shallow copy or a
deep copy.

If your item contains :term:`mutable` values like lists or dictionaries,
a shallow copy will keep references to the same mutable values across all
different copies.

For example, if you have an item with a list of tags, and you create a shallow
copy of that item, both the original item and the copy have the same list of
tags. Adding a tag to the list of one of the items will add the tag to the
other item as well.

If that is not the desired behavior, use a deep copy instead.

See :mod:`copy` for more information.

To create a shallow copy of an item, you can either call
:meth:`~scrapy.Item.copy` on an existing item
(``product2 = product.copy()``) or instantiate your item class from an existing
item (``product2 = Product(product)``).

To create a deep copy, call :meth:`~scrapy.Item.deepcopy` instead
(``product2 = product.deepcopy()``).

Other common tasks
''''''''''''''''''

Creating dicts from items:

.. code-block:: pycon

    >>> dict(product)  # create a dict from all populated values
    {'price': 1000, 'name': 'Desktop PC'}

    Creating items from dicts:

    >>> Product({"name": "Laptop PC", "price": 1500})
    Product(price=1500, name='Laptop PC')

    >>> Product({"name": "Laptop PC", "lala": 1500})  # warning: unknown field in dict
    Traceback (most recent call last):
        ...
    KeyError: 'Product does not support field: lala'

Extending Item subclasses
-------------------------

You can extend Items (to add more fields or to change some metadata for some
fields) by declaring a subclass of your original Item.

For example:

.. code-block:: python

    class DiscountedProduct(Product):
        discount_percent = scrapy.Field(serializer=str)
        discount_expiration_date = scrapy.Field()

You can also extend field metadata by using the previous field metadata and
appending more values, or changing existing values, like this:

.. code-block:: python

    class SpecificProduct(Product):
        name = scrapy.Field(Product.fields["name"], serializer=my_serializer)

That adds (or replaces) the ``serializer`` metadata key for the ``name`` field,
keeping all the previously existing metadata values.

.. skip: end

.. _supporting-item-types:

Supporting All Item Types
=========================

In code that receives an item, such as methods of :ref:`item pipelines
<topics-item-pipeline>` or :ref:`spider middlewares
<topics-spider-middleware>`, it is a good practice to use the
:class:`~itemadapter.ItemAdapter` class to write code that works for any
supported item type.

Other classes related to items
==============================

.. autoclass:: ItemMeta


.. _topics-jobs:

=================================
Jobs: pausing and resuming crawls
=================================

Sometimes, for big sites, it's desirable to pause crawls and be able to resume
them later.

Scrapy supports this functionality out of the box by providing the following
facilities:

* a scheduler that persists scheduled requests on disk

* a duplicates filter that persists visited requests on disk

* an extension that keeps some spider state (key/value pairs) persistent
  between batches

.. _job-dir:

Job directory
=============

To enable persistence support, define a *job directory* through the
:setting:`JOBDIR` setting.

The job directory will store all required data to keep the state of a *single*
job (i.e. a spider run), so that if stopped cleanly, it can be resumed later.

.. warning:: This directory must *not* be shared by different spiders, or even
    different jobs of the same spider.

.. warning:: Treat the job directory with the same security care as your
    Scrapy project source code. Do not point ``JOBDIR`` to a path that
    untrusted parties can write to.

See also :ref:`job-dir-contents`.

How to use it
=============

To start a spider with persistence support enabled, run it like this::

    scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending
a signal), and resume it later by issuing the same command::

    scrapy crawl somespider -s JOBDIR=crawls/somespider-1

.. _topics-keeping-persistent-state-between-batches:

Keeping persistent state between batches
========================================

Sometimes you'll want to keep some persistent spider state between pause/resume
batches. You can use the ``spider.state`` attribute for that, which should be a
dict. There's :ref:`a built-in extension <topics-extensions-ref-spiderstate>`
that takes care of serializing, storing and loading that attribute from the job
directory, when the spider starts and stops.

Here's an example of a callback that uses the spider state (other spider code
is omitted for brevity):

.. code-block:: python

    def parse_item(self, response):
        # parse item here
        self.state["items_count"] = self.state.get("items_count", 0) + 1

Persistence gotchas
===================

There are a few things to keep in mind if you want to be able to use the Scrapy
persistence support:

Pause limitations
-----------------

Job pausing and resuming is only supported when the spider is paused by
stopping it cleanly. Forced, sudden or otherwise unclean shutdown can lead to
data corruption in the job directory, which may prevent the spider from
resuming correctly.

Cookies expiration
------------------

Cookies may expire. So, if you don't resume your spider quickly the requests
scheduled may no longer work. This won't be an issue if your spider doesn't rely
on cookies.

.. _request-serialization:

Request serialization
---------------------

For persistence to work, :class:`~scrapy.Request` objects must be
serializable with :mod:`pickle`, except for the ``callback`` and ``errback``
values passed to their ``__init__`` method, which must be methods of the
running :class:`~scrapy.Spider` class.

If you wish to log the requests that couldn't be serialized, you can set the
:setting:`SCHEDULER_DEBUG` setting to ``True`` in the project's settings page.
It is ``False`` by default.

.. _job-dir-contents:

Job directory contents
======================

The contents of a job directory depend on the components used during the job.
Components known to write in the job directory include the :ref:`scheduler
<topics-scheduler>` and the :class:`~scrapy.extensions.spiderstate.SpiderState`
extension. See the reference documentation of the corresponding components for
details.

For example, with default settings, the job directory may look like this:

.. code-block:: none

    ├── requests.queue
    |   ├── active.json
    |   └── {hostname}-{hash}
    |       └── {priority}{s?}
    |           ├── q{00000}
    |           └── info.json
    ├── requests.seen
    └── spider.state

Where:

-   :class:`~scrapy.core.scheduler.Scheduler` creates the ``requests.queue/``
    directory and the ``active.json`` file, the latter containing the state
    data returned by :meth:`DownloaderAwarePriorityQueue.close()
    <scrapy.pqueues.DownloaderAwarePriorityQueue.close>` the last time the job
    was paused.

-   :class:`~scrapy.pqueues.DownloaderAwarePriorityQueue` creates the
    ``{hostname}-{hash}`` directories.

-   :class:`~scrapy.pqueues.ScrapyPriorityQueue` creates the ``{priority}{s?}``
    directories.

-   :class:`scrapy.squeues.PickleLifoDiskQueue`, a subclass of
    :class:`queuelib.LifoDiskQueue` that uses :mod:`pickle` to serialize
    :class:`dict` representations of :class:`scrapy.Request` objects, creates
    the ``info.json`` and ``q{00000}`` files.

-   :class:`~scrapy.dupefilters.RFPDupeFilter` creates the ``requests.seen``
    file.

-   :class:`~scrapy.extensions.spiderstate.SpiderState` creates the
    ``spider.state`` file.


.. _topics-leaks:

======================
Debugging memory leaks
======================

In Scrapy, objects such as requests, responses and items have a finite
lifetime: they are created, used for a while, and finally destroyed.

From all those objects, the Request is probably the one with the longest
lifetime, as it stays waiting in the Scheduler queue until it's time to process
it. For more info see :ref:`topics-architecture`.

As these Scrapy objects have a (rather long) lifetime, there is always the risk
of accumulating them in memory without releasing them properly and thus causing
what is known as a "memory leak".

To help debugging memory leaks, Scrapy provides a built-in mechanism for
tracking objects references called :ref:`trackref <topics-leaks-trackrefs>`,
and you can also use a third-party library called :ref:`muppy
<topics-leaks-muppy>` for more advanced memory debugging (see below for more
info). Both mechanisms must be used from the :ref:`Telnet Console
<topics-telnetconsole>`.

Common causes of memory leaks
=============================

It happens quite often (sometimes by accident, sometimes on purpose) that the
Scrapy developer passes objects referenced in Requests (for example, using the
:attr:`~scrapy.Request.cb_kwargs` or :attr:`~scrapy.Request.meta`
attributes or the request callback function) and that effectively bounds the
lifetime of those referenced objects to the lifetime of the Request. This is,
by far, the most common cause of memory leaks in Scrapy projects, and a quite
difficult one to debug for newcomers.

In big projects, the spiders are typically written by different people and some
of those spiders could be "leaking" and thus affecting the rest of the other
(well-written) spiders when they get to run concurrently, which, in turn,
affects the whole crawling process.

The leak could also come from a custom middleware, pipeline or extension that
you have written, if you are not releasing the (previously allocated) resources
properly. For example, allocating resources on :signal:`spider_opened`
but not releasing them on :signal:`spider_closed` may cause problems if
you're running :ref:`multiple spiders per process <run-multiple-spiders>`.

Too Many Requests?
------------------

By default Scrapy keeps the request queue in memory; it includes
:class:`~scrapy.Request` objects and all objects
referenced in Request attributes (e.g. in :attr:`~scrapy.Request.cb_kwargs`
and :attr:`~scrapy.Request.meta`).
While not necessarily a leak, this can take a lot of memory. Enabling
:ref:`persistent job queue <topics-jobs>` could help keeping memory usage
in control.

.. _topics-leaks-trackrefs:

Debugging memory leaks with ``trackref``
========================================

.. skip: start

:mod:`trackref` is a module provided by Scrapy to debug the most common cases of
memory leaks. It basically tracks the references to all live Request,
Response, Item, Spider and Selector objects.

You can enter the telnet console and inspect how many objects (of the classes
mentioned above) are currently alive using the ``prefs()`` function which is an
alias to the :func:`~scrapy.utils.trackref.print_live_refs` function::

    telnet localhost 6023

    .. code-block:: pycon

        >>> prefs()
        Live References

        ExampleSpider                       1   oldest: 15s ago
        HtmlResponse                       10   oldest: 1s ago
        Selector                            2   oldest: 0s ago
        FormRequest                       878   oldest: 7s ago

As you can see, that report also shows the "age" of the oldest object in each
class. If you're running multiple spiders per process chances are you can
figure out which spider is leaking by looking at the oldest request or response.
You can get the oldest object of each class using the
:func:`~scrapy.utils.trackref.get_oldest` function (from the telnet console).

Which objects are tracked?
--------------------------

The objects tracked by ``trackrefs`` are all from these classes (and all its
subclasses):

* :class:`scrapy.Request`
* :class:`scrapy.http.Response`
* :class:`scrapy.Item`
* :class:`scrapy.Selector`
* :class:`scrapy.Spider`

A real example
--------------

Let's see a concrete example of a hypothetical case of memory leaks.
Suppose we have some spider with a line similar to this one::

    return Request(f"http://www.somenastyspider.com/product.php?pid={product_id}",
                   callback=self.parse, cb_kwargs={'referer': response})

That line is passing a response reference inside a request which effectively
ties the response lifetime to the requests' one, and that would definitely
cause memory leaks.

Let's see how we can discover the cause (without knowing it
a priori, of course) by using the ``trackref`` tool.

After the crawler is running for a few minutes and we notice its memory usage
has grown a lot, we can enter its telnet console and check the live
references:

.. code-block:: pycon

    >>> prefs()
    Live References

    SomenastySpider                     1   oldest: 15s ago
    HtmlResponse                     3890   oldest: 265s ago
    Selector                            2   oldest: 0s ago
    Request                          3878   oldest: 250s ago

The fact that there are so many live responses (and that they're so old) is
definitely suspicious, as responses should have a relatively short lifetime
compared to Requests. The number of responses is similar to the number
of requests, so it looks like they are tied in a some way. We can now go
and check the code of the spider to discover the nasty line that is
generating the leaks (passing response references inside requests).

Sometimes extra information about live objects can be helpful.
Let's check the oldest response:

.. code-block:: pycon

    >>> from scrapy.utils.trackref import get_oldest
    >>> r = get_oldest("HtmlResponse")
    >>> r.url
    'http://www.somenastyspider.com/product.php?pid=123'

If you want to iterate over all objects, instead of getting the oldest one, you
can use the :func:`scrapy.utils.trackref.iter_all` function:

.. code-block:: pycon

    >>> from scrapy.utils.trackref import iter_all
    >>> [r.url for r in iter_all("HtmlResponse")]
    ['http://www.somenastyspider.com/product.php?pid=123',
    'http://www.somenastyspider.com/product.php?pid=584',
    ...]

Too many spiders?
-----------------

If your project has too many spiders executed in parallel,
the output of :func:`prefs` can be difficult to read.
For this reason, that function has a ``ignore`` argument which can be used to
ignore a particular class (and all its subclasses). For
example, this won't show any live references to spiders:

.. code-block:: pycon

    >>> from scrapy.spiders import Spider
    >>> prefs(ignore=Spider)

.. module:: scrapy.utils.trackref
   :synopsis: Track references of live objects

scrapy.utils.trackref module
----------------------------

Here are the functions available in the :mod:`~scrapy.utils.trackref` module.

.. class:: object_ref

    Inherit from this class if you want to track live
    instances with the ``trackref`` module.

.. function:: print_live_refs(class_name, ignore=NoneType)

    Print a report of live references, grouped by class name.

    :param ignore: if given, all objects from the specified class (or tuple of
        classes) will be ignored.
    :type ignore: type or tuple

.. function:: get_oldest(class_name)

    Return the oldest object alive with the given class name, or ``None`` if
    none is found. Use :func:`print_live_refs` first to get a list of all
    tracked live objects per class name.

.. function:: iter_all(class_name)

    Return an iterator over all objects alive with the given class name, or
    ``None`` if none is found. Use :func:`print_live_refs` first to get a list
    of all tracked live objects per class name.

.. skip: end

.. _topics-leaks-muppy:

Debugging memory leaks with muppy
=================================

``trackref`` provides a very convenient mechanism for tracking down memory
leaks, but it only keeps track of the objects that are more likely to cause
memory leaks. However, there are other cases where the memory leaks could come
from other (more or less obscure) objects. If this is your case, and you can't
find your leaks using ``trackref``, you still have another resource: the muppy
library.

You can use muppy from `Pympler`_.

.. _Pympler: https://pypi.org/project/Pympler/

If you use ``pip``, you can install muppy with the following command::

    pip install Pympler

Here's an example to view all Python objects available in
the heap using muppy:

.. skip: start
.. code-block:: pycon

    >>> from pympler import muppy
    >>> all_objects = muppy.get_objects()
    >>> len(all_objects)
    28667
    >>> from pympler import summary
    >>> suml = summary.summarize(all_objects)
    >>> summary.print_(suml)
                                   types |   # objects |   total size
    ==================================== | =========== | ============
                             <class 'str |        9822 |      1.10 MB
                            <class 'dict |        1658 |    856.62 KB
                            <class 'type |         436 |    443.60 KB
                            <class 'code |        2974 |    419.56 KB
              <class '_io.BufferedWriter |           2 |    256.34 KB
                             <class 'set |         420 |    159.88 KB
              <class '_io.BufferedReader |           1 |    128.17 KB
              <class 'wrapper_descriptor |        1130 |     88.28 KB
                           <class 'tuple |        1304 |     86.57 KB
                         <class 'weakref |        1013 |     79.14 KB
      <class 'builtin_function_or_method |         958 |     67.36 KB
               <class 'method_descriptor |         865 |     60.82 KB
                     <class 'abc.ABCMeta |          62 |     59.96 KB
                            <class 'list |         446 |     58.52 KB
                             <class 'int |        1425 |     43.20 KB

.. skip: end

For more info about muppy, refer to the `muppy documentation`_.

.. _muppy documentation: https://pythonhosted.org/Pympler/muppy.html

.. _topics-leaks-without-leaks:

Leaks without leaks
===================

Sometimes, you may notice that the memory usage of your Scrapy process will
only increase, but never decrease. Unfortunately, this could happen even
though neither Scrapy nor your project are leaking memory. This is due to a
(not so well) known problem of Python, which may not return released memory to
the operating system in some cases. For more information on this issue see:

* `Python Memory Management <https://www.evanjones.ca/python-memory.html>`_
* `Python Memory Management Part 2 <https://www.evanjones.ca/python-memory-part2.html>`_
* `Python Memory Management Part 3 <https://www.evanjones.ca/python-memory-part3.html>`_

The improvements proposed by Evan Jones, which are detailed in `this paper`_,
got merged in Python 2.5, but this only reduces the problem, it doesn't fix it
completely. To quote the paper:

    *Unfortunately, this patch can only free an arena if there are no more
    objects allocated in it anymore. This means that fragmentation is a large
    issue. An application could have many megabytes of free memory, scattered
    throughout all the arenas, but it will be unable to free any of it. This is
    a problem experienced by all memory allocators. The only way to solve it is
    to move to a compacting garbage collector, which is able to move objects in
    memory. This would require significant changes to the Python interpreter.*

.. _this paper: https://www.evanjones.ca/memoryallocator/

To keep memory consumption reasonable you can split the job into several
smaller jobs or enable :ref:`persistent job queue <topics-jobs>`
and stop/start spider from time to time.


.. _topics-link-extractors:

===============
Link Extractors
===============

A link extractor is an object that extracts links from responses.

The ``__init__`` method of
:class:`~scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor` takes settings that
determine which links may be extracted. :class:`LxmlLinkExtractor.extract_links
<scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links>` returns a
list of matching :class:`~scrapy.link.Link` objects from a
:class:`~scrapy.http.Response` object.

Link extractors are used in :class:`~scrapy.spiders.CrawlSpider` spiders
through a set of :class:`~scrapy.spiders.Rule` objects.

You can also use link extractors in regular spiders. For example, you can instantiate
:class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` into a class
variable in your spider, and use it from your spider callbacks:

.. code-block:: python

    def parse(self, response):
        for link in self.link_extractor.extract_links(response):
            yield Request(link.url, callback=self.parse)

.. _topics-link-extractors-ref:

Link extractor reference
========================

.. module:: scrapy.linkextractors
   :synopsis: Link extractors classes

The link extractor class is
:class:`scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor`. For convenience it
can also be imported as ``scrapy.linkextractors.LinkExtractor``::

    from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor
-----------------

.. module:: scrapy.linkextractors.lxmlhtml
   :synopsis: lxml's HTMLParser-based link extractors

.. class:: LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, strip=True)

    LxmlLinkExtractor is the recommended link extractor with handy filtering
    options. It is implemented using lxml's robust HTMLParser.

    :param allow: a single regular expression (or list of regular expressions)
        that the (absolute) urls must match in order to be extracted. If not
        given (or empty), it will match all links.
    :type allow: str or list

    :param deny: a single regular expression (or list of regular expressions)
        that the (absolute) urls must match in order to be excluded (i.e. not
        extracted). It has precedence over the ``allow`` parameter. If not
        given (or empty) it won't exclude any links.
    :type deny: str or list

    :param allow_domains: a single value or a list of string containing
        domains which will be considered for extracting the links
    :type allow_domains: str or list

    :param deny_domains: a single value or a list of strings containing
        domains which won't be considered for extracting the links
    :type deny_domains: str or list

    :param deny_extensions: a single value or list of strings containing
        extensions that should be ignored when extracting links.
        If not given, it will default to
        :data:`scrapy.linkextractors.IGNORED_EXTENSIONS`.

    :type deny_extensions: list

    :param restrict_xpaths: is an XPath (or list of XPath's) which defines
        regions inside the response where links should be extracted from.
        If given, only the text selected by those XPath will be scanned for
        links.
    :type restrict_xpaths: str or list

    :param restrict_css: a CSS selector (or list of selectors) which defines
        regions inside the response where links should be extracted from.
        Has the same behaviour as ``restrict_xpaths``.
    :type restrict_css: str or list

    :param restrict_text: a single regular expression (or list of regular expressions)
        that the link's text must match in order to be extracted. If not
        given (or empty), it will match all links. If a list of regular expressions is
        given, the link will be extracted if it matches at least one.
    :type restrict_text: str or list

    :param tags: a tag or a list of tags to consider when extracting links.
        Defaults to ``('a', 'area')``.
    :type tags: str or list

    :param attrs: an attribute or list of attributes which should be considered when looking
        for links to extract (only for those tags specified in the ``tags``
        parameter). Defaults to ``('href',)``
    :type attrs: list

    :param canonicalize: canonicalize each extracted url (using
        w3lib.url.canonicalize_url). Defaults to ``False``.
        Note that canonicalize_url is meant for duplicate checking;
        it can change the URL visible at server side, so the response can be
        different for requests with canonicalized and raw URLs. If you're
        using LinkExtractor to follow links it is more robust to
        keep the default ``canonicalize=False``.
    :type canonicalize: bool

    :param unique: whether duplicate filtering should be applied to extracted
        links.
    :type unique: bool

    :param process_value: a function which receives each value extracted from
        the tag and attributes scanned and can modify the value and return a
        new one, or return ``None`` to ignore the link altogether. If not
        given, ``process_value`` defaults to ``lambda x: x``.

        .. highlight:: html

        For example, to extract links from this code::

            <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>

        .. highlight:: python

        You can use the following function in ``process_value``:

        .. code-block:: python

            def process_value(value):
                m = re.search(r"javascript:goToPage\('(.*?)'", value)
                if m:
                    return m.group(1)

    :type process_value: collections.abc.Callable

    :param strip: whether to strip whitespaces from extracted attributes.
        According to HTML5 standard, leading and trailing whitespaces
        must be stripped from ``href`` attributes of ``<a>``, ``<area>``
        and many other elements, ``src`` attribute of ``<img>``, ``<iframe>``
        elements, etc., so LinkExtractor strips space chars by default.
        Set ``strip=False`` to turn it off (e.g. if you're extracting urls
        from elements or attributes which allow leading/trailing whitespaces).
    :type strip: bool

    .. automethod:: extract_links

Link
----

.. module:: scrapy.link
   :synopsis: Link from link extractors

.. autoclass:: Link


.. _topics-loaders:

============
Item Loaders
============

.. module:: scrapy.loader
   :synopsis: Item Loader class

Item Loaders provide a convenient mechanism for populating scraped :ref:`items
<topics-items>`. Even though items can be populated directly, Item Loaders provide a
much more convenient API for populating them from a scraping process, by automating
some common tasks like parsing the raw extracted data before assigning it.

In other words, :ref:`items <topics-items>` provide the *container* of
scraped data, while Item Loaders provide the mechanism for *populating* that
container.

Item Loaders are designed to provide a flexible, efficient and easy mechanism
for extending and overriding different field parsing rules, either by spider,
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.

.. note:: Item Loaders are an extension of the itemloaders_ library that make it
    easier to work with Scrapy by adding support for
    :ref:`responses <topics-request-response>`.

Using Item Loaders to populate items
====================================

To use an Item Loader, you must first instantiate it. You can either
instantiate it with an :ref:`item object <topics-items>` or without one, in which
case an :ref:`item object <topics-items>` is automatically created in the
Item Loader ``__init__`` method using the :ref:`item <topics-items>` class
specified in the :attr:`ItemLoader.default_item_class` attribute.

Then, you start collecting values into the Item Loader, typically using
:ref:`Selectors <topics-selectors>`. You can add more than one value to
the same item field; the Item Loader will know how to "join" those values later
using a proper processing function.

.. note:: Collected data is internally stored as lists,
   allowing to add several values to the same field.
   If an ``item`` argument is passed when creating a loader,
   each of the item's values will be stored as-is if it's already
   an iterable, or wrapped with a list if it's a single value.

Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
the :ref:`Product item <topics-items-declaring>` declared in the :ref:`Items
chapter <topics-items>`:

.. skip: next
.. code-block:: python

    from scrapy.loader import ItemLoader
    from myproject.items import Product

    def parse(self, response):
        l = ItemLoader(item=Product(), response=response)
        l.add_xpath("name", '//div[@class="product_name"]')
        l.add_xpath("name", '//div[@class="product_title"]')
        l.add_xpath("price", '//p[@id="price"]')
        l.add_css("stock", "p#stock")
        l.add_value("last_updated", "today")  # you can also use literal values
        return l.load_item()

By quickly looking at that code, we can see the ``name`` field is being
extracted from two different XPath locations in the page:

1. ``//div[@class="product_name"]``
2. ``//div[@class="product_title"]``

In other words, data is being collected by extracting it from two XPath
locations, using the :meth:`~ItemLoader.add_xpath` method. This is the
data that will be assigned to the ``name`` field later.

Afterwards, similar calls are used for ``price`` and ``stock`` fields
(the latter using a CSS selector with the :meth:`~ItemLoader.add_css` method),
and finally the ``last_update`` field is populated directly with a literal value
(``today``) using a different method: :meth:`~ItemLoader.add_value`.

Finally, when all data is collected, the :meth:`ItemLoader.load_item` method is
called which actually returns the item populated with the data
previously extracted and collected with the :meth:`~ItemLoader.add_xpath`,
:meth:`~ItemLoader.add_css`, and :meth:`~ItemLoader.add_value` calls.

.. _topics-loaders-dataclass:

Working with dataclass items
============================

By default, :ref:`dataclass items <dataclass-items>` require all fields to be
passed when created. This could be an issue when using dataclass items with
item loaders: unless a pre-populated item is passed to the loader, fields
will be populated incrementally using the loader's :meth:`~ItemLoader.add_xpath`,
:meth:`~ItemLoader.add_css` and :meth:`~ItemLoader.add_value` methods.

One approach to overcome this is to define items using the
:func:`~dataclasses.field` function, with a ``default`` argument:

.. code-block:: python

    from dataclasses import dataclass, field
    from typing import Optional

    @dataclass
    class InventoryItem:
        name: Optional[str] = field(default=None)
        price: Optional[float] = field(default=None)
        stock: Optional[int] = field(default=None)

.. _topics-loaders-processors:

Input and Output processors
===========================

An Item Loader contains one input processor and one output processor for each
(item) field. The input processor processes the extracted data as soon as it's
received (through the :meth:`~ItemLoader.add_xpath`, :meth:`~ItemLoader.add_css` or
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
collected and kept inside the ItemLoader. After collecting all data, the
:meth:`ItemLoader.load_item` method is called to populate and get the populated
:ref:`item object <topics-items>`.  That's when the output processor is
called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets
assigned to the item.

Let's see an example to illustrate how the input and output processors are
called for a particular field (the same applies for any other field):

.. skip: next
.. code-block:: python

    l = ItemLoader(Product(), some_selector)
    l.add_xpath("name", xpath1)  # (1)
    l.add_xpath("name", xpath2)  # (2)
    l.add_css("name", css)  # (3)
    l.add_value("name", "test")  # (4)
    return l.load_item()  # (5)

So what happens is:

1. Data from ``xpath1`` is extracted, and passed through the *input processor* of
   the ``name`` field. The result of the input processor is collected and kept in
   the Item Loader (but not yet assigned to the item).

2. Data from ``xpath2`` is extracted, and passed through the same *input
   processor* used in (1). The result of the input processor is appended to the
   data collected in (1) (if any).

3. This case is similar to the previous ones, except that the data is extracted
   from the ``css`` CSS selector, and passed through the same *input
   processor* used in (1) and (2). The result of the input processor is appended to the
   data collected in (1) and (2) (if any).

4. This case is also similar to the previous ones, except that the value to be
   collected is assigned directly, instead of being extracted from a XPath
   expression or a CSS selector.
   However, the value is still passed through the input processors. In this
   case, since the value is not iterable it is converted to an iterable of a
   single element before passing it to the input processor, because input
   processor always receive iterables.

5. The data collected in steps (1), (2), (3) and (4) is passed through
   the *output processor* of the ``name`` field.
   The result of the output processor is the value assigned to the ``name``
   field in the item.

It's worth noticing that processors are just callable objects, which are called
with the data to be parsed, and return a parsed value. So you can use any
function as input or output processor. The only requirement is that they must
accept one (and only one) positional argument, which will be an iterable.

.. note:: Both input and output processors must receive an iterable as their
   first argument. The output of those functions can be anything. The result of
   input processors will be appended to an internal list (in the Loader)
   containing the collected values (for that field). The result of the output
   processors is the value that will be finally assigned to the item.

The other thing you need to keep in mind is that the values returned by input
processors are collected internally (in lists) and then passed to output
processors to populate the fields.

Last, but not least, itemloaders_ comes with some :ref:`commonly used
processors <itemloaders:built-in-processors>` built-in for convenience.

Declaring Item Loaders
======================

Item Loaders are declared using a class definition syntax. Here is an example:

.. code-block:: python

    from itemloaders.processors import TakeFirst, MapCompose, Join
    from scrapy.loader import ItemLoader

    class ProductLoader(ItemLoader):
        default_output_processor = TakeFirst()

        name_in = MapCompose(str.title)
        name_out = Join()

        price_in = MapCompose(str.strip)

        # ...

As you can see, input processors are declared using the ``_in`` suffix while
output processors are declared using the ``_out`` suffix. And you can also
declare a default input/output processors using the
:attr:`ItemLoader.default_input_processor` and
:attr:`ItemLoader.default_output_processor` attributes.

.. _topics-loaders-processors-declaring:

Declaring Input and Output Processors
=====================================

As seen in the previous section, input and output processors can be declared in
the Item Loader definition, and it's very common to declare input processors
this way. However, there is one more place where you can specify the input and
output processors to use: in the :ref:`Item Field <topics-items-fields>`
metadata. Here is an example:

.. code-block:: python

    import scrapy
    from itemloaders.processors import Join, MapCompose, TakeFirst
    from w3lib.html import remove_tags

    def filter_price(value):
        if value.isdigit():
            return value

    class Product(scrapy.Item):
        name = scrapy.Field(
            input_processor=MapCompose(remove_tags),
            output_processor=Join(),
        )
        price = scrapy.Field(
            input_processor=MapCompose(remove_tags, filter_price),
            output_processor=TakeFirst(),
        )

.. skip: start
.. code-block:: pycon

    >>> from scrapy.loader import ItemLoader
    >>> il = ItemLoader(item=Product())
    >>> il.add_value("name", ["Welcome to my", "<strong>website</strong>"])
    >>> il.add_value("price", ["&euro;", "<span>1000</span>"])
    >>> il.load_item()
    {'name': 'Welcome to my website', 'price': '1000'}

.. skip: end

The precedence order, for both input and output processors, is as follows:

1. Item Loader field-specific attributes: ``field_in`` and ``field_out`` (most
   precedence)
2. Field metadata (``input_processor`` and ``output_processor`` key)
3. Item Loader defaults: :meth:`ItemLoader.default_input_processor` and
   :meth:`ItemLoader.default_output_processor` (least precedence)

See also: :ref:`topics-loaders-extending`.

.. _topics-loaders-context:

Item Loader Context
===================

The Item Loader Context is a dict of arbitrary key/values which is shared among
all input and output processors in the Item Loader. It can be passed when
declaring, instantiating or using Item Loader. They are used to modify the
behaviour of the input/output processors.

For example, suppose you have a function ``parse_length`` which receives a text
value and extracts a length from it:

.. code-block:: python

    def parse_length(text, loader_context):
        unit = loader_context.get("unit", "m")
        # ... length parsing code goes here ...
        return parsed_length

By accepting a ``loader_context`` argument the function is explicitly telling
the Item Loader that it's able to receive an Item Loader context, so the Item
Loader passes the currently active context when calling it, and the processor
function (``parse_length`` in this case) can thus use them.

.. skip: start

There are several ways to modify Item Loader context values:

1. By modifying the currently active Item Loader context
   (:attr:`~ItemLoader.context` attribute):

   .. code-block:: python

      loader = ItemLoader(product)
      loader.context["unit"] = "cm"

2. On Item Loader instantiation (the keyword arguments of Item Loader
   ``__init__`` method are stored in the Item Loader context):

   .. code-block:: python

      loader = ItemLoader(product, unit="cm")

3. On Item Loader declaration, for those input/output processors that support
   instantiating them with an Item Loader context. :class:`~processor.MapCompose` is one of
   them:

   .. code-block:: python

       class ProductLoader(ItemLoader):
           length_out = MapCompose(parse_length, unit="cm")

.. skip: end

ItemLoader objects
==================

.. autoclass:: scrapy.loader.ItemLoader
    :members:
    :inherited-members:

.. _topics-loaders-nested:

Nested Loaders
==============

When parsing related values from a subsection of a document, it can be
useful to create nested loaders.  Imagine you're extracting details from
a footer of a page that looks something like:

Example::

    <footer>
        <a class="social" href="https://facebook.com/whatever">Like Us</a>
        <a class="social" href="https://twitter.com/whatever">Follow Us</a>
        <a class="email" href="mailto:whatever@example.com">Email Us</a>
    </footer>

Without nested loaders, you need to specify the full xpath (or css) for each value
that you wish to extract.

Example:

.. skip: next
.. code-block:: python

    loader = ItemLoader(item=Item())
    # load stuff not in the footer
    loader.add_xpath("social", '//footer/a[@class = "social"]/@href')
    loader.add_xpath("email", '//footer/a[@class = "email"]/@href')
    loader.load_item()

Instead, you can create a nested loader with the footer selector and add values
relative to the footer.  The functionality is the same but you avoid repeating
the footer selector.

Example:

.. skip: next
.. code-block:: python

    loader = ItemLoader(item=Item())
    # load stuff not in the footer
    footer_loader = loader.nested_xpath("//footer")
    footer_loader.add_xpath("social", 'a[@class = "social"]/@href')
    footer_loader.add_xpath("email", 'a[@class = "email"]/@href')
    # no need to call footer_loader.load_item()
    loader.load_item()

You can nest loaders arbitrarily and they work with either xpath or css selectors.
As a general guideline, use nested loaders when they make your code simpler but do
not go overboard with nesting or your parser can become difficult to read.

.. _topics-loaders-extending:

Reusing and extending Item Loaders
==================================

As your project grows bigger and acquires more and more spiders, maintenance
becomes a fundamental problem, especially when you have to deal with many
different parsing rules for each spider, having a lot of exceptions, but also
wanting to reuse the common processors.

Item Loaders are designed to ease the maintenance burden of parsing rules,
without losing flexibility and, at the same time, providing a convenient
mechanism for extending and overriding them. For this reason Item Loaders
support traditional Python class inheritance for dealing with differences of
specific spiders (or groups of spiders).

Suppose, for example, that some particular site encloses their product names in
three dashes (e.g. ``---Plasma TV---``) and you don't want to end up scraping
those dashes in the final product names.

Here's how you can remove those dashes by reusing and extending the default
Product Item Loader (``ProductLoader``):

.. skip: next
.. code-block:: python

    from itemloaders.processors import MapCompose
    from myproject.ItemLoaders import ProductLoader

    def strip_dashes(x):
        return x.strip("-")

    class SiteSpecificLoader(ProductLoader):
        name_in = MapCompose(strip_dashes, ProductLoader.name_in)

Another case where extending Item Loaders can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
want to remove ``CDATA`` occurrences. Here's an example of how to do it:

.. skip: next
.. code-block:: python

    from itemloaders.processors import MapCompose
    from myproject.ItemLoaders import ProductLoader
    from myproject.utils.xml import remove_cdata

    class XmlProductLoader(ProductLoader):
        name_in = MapCompose(remove_cdata, ProductLoader.name_in)

And that's how you typically extend input processors.

As for output processors, it is more common to declare them in the field metadata,
as they usually depend only on the field and not on each specific site parsing
rule (as input processors do). See also:
:ref:`topics-loaders-processors-declaring`.

There are many other possible ways to extend, inherit and override your Item
Loaders, and different Item Loaders hierarchies may fit better for different
projects. Scrapy only provides the mechanism; it doesn't impose any specific
organization of your Loaders collection - that's up to you and your project's
needs.

.. _itemloaders: https://itemloaders.readthedocs.io/en/latest/


.. _topics-logging:

=======
Logging
=======

.. note::
    :mod:`scrapy.log` has been deprecated alongside its functions in favor of
    explicit calls to the Python standard logging. Keep reading to learn more
    about the new logging system.

Scrapy uses :mod:`logging` for event logging. We'll
provide some simple examples to get you started, but for more advanced
use-cases it's strongly suggested to read thoroughly its documentation.

Logging works out of the box, and can be configured to some extent with the
Scrapy settings listed in :ref:`topics-logging-settings`.

Scrapy calls :func:`scrapy.utils.log.configure_logging` to set some reasonable
defaults and handle those settings in :ref:`topics-logging-settings` when
running commands, so it's recommended to manually call it if you're running
Scrapy from scripts as described in :ref:`run-from-script`.

.. _topics-logging-levels:

Log levels
==========

Python's builtin logging defines 5 different levels to indicate the severity of a
given log message. Here are the standard ones, listed in decreasing order:

1. ``logging.CRITICAL`` - for critical errors (highest severity)
2. ``logging.ERROR`` - for regular errors
3. ``logging.WARNING`` - for warning messages
4. ``logging.INFO`` - for informational messages
5. ``logging.DEBUG`` - for debugging messages (lowest severity)

How to log messages
===================

Here's a quick example of how to log a message using the ``logging.WARNING``
level:

.. code-block:: python

    import logging

    logging.warning("This is a warning")

There are shortcuts for issuing log messages on any of the standard 5 levels,
and there's also a general ``logging.log`` method which takes a given level as
argument.  If needed, the last example could be rewritten as:

.. code-block:: python

    import logging

    logging.log(logging.WARNING, "This is a warning")

On top of that, you can create different "loggers" to encapsulate messages. (For
example, a common practice is to create different loggers for every module).
These loggers can be configured independently, and they allow hierarchical
constructions.

The previous examples use the root logger behind the scenes, which is a top level
logger where all messages are propagated to (unless otherwise specified). Using
``logging`` helpers is merely a shortcut for getting the root logger
explicitly, so this is also an equivalent of the last snippets:

.. code-block:: python

    import logging

    logger = logging.getLogger()
    logger.warning("This is a warning")

You can use a different logger just by getting its name with the
``logging.getLogger`` function:

.. code-block:: python

    import logging

    logger = logging.getLogger("mycustomlogger")
    logger.warning("This is a warning")

Finally, you can ensure having a custom logger for any module you're working on
by using the ``__name__`` variable, which is populated with current module's
path:

.. code-block:: python

    import logging

    logger = logging.getLogger(__name__)
    logger.warning("This is a warning")

.. seealso::

    Module logging, :doc:`HowTo <howto/logging>`
        Basic Logging Tutorial

    Module logging, :ref:`Loggers <logger>`
        Further documentation on loggers

.. _topics-logging-from-spiders:

Logging from Spiders
====================

Scrapy provides a :data:`~scrapy.Spider.logger` within each Spider
instance, which can be accessed and used like this:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = ["https://scrapy.org"]

        def parse(self, response):
            self.logger.info("Parse function called on %s", response.url)

That logger is created using the Spider's name, but you can use any custom
Python logger you want. For example:

.. code-block:: python

    import logging
    import scrapy

    logger = logging.getLogger("mycustomlogger")

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = ["https://scrapy.org"]

        def parse(self, response):
            logger.info("Parse function called on %s", response.url)

.. _topics-logging-configuration:

Logging configuration
=====================

Loggers on their own don't manage how messages sent through them are displayed.
For this task, different "handlers" can be attached to any logger instance and
they will redirect those messages to appropriate destinations, such as the
standard output, files, emails, etc.

By default, Scrapy sets and configures a handler for the root logger, based on
the settings below.

.. _topics-logging-settings:

Logging settings
----------------

These settings can be used to configure the logging:

* :setting:`LOG_FILE`
* :setting:`LOG_FILE_APPEND`
* :setting:`LOG_ENABLED`
* :setting:`LOG_ENCODING`
* :setting:`LOG_LEVEL`
* :setting:`LOG_FORMAT`
* :setting:`LOG_DATEFORMAT`
* :setting:`LOG_STDOUT`
* :setting:`LOG_SHORT_NAMES`

The first couple of settings define a destination for log messages. If
:setting:`LOG_FILE` is set, messages sent through the root logger will be
redirected to a file named :setting:`LOG_FILE` with encoding
:setting:`LOG_ENCODING`. If unset and :setting:`LOG_ENABLED` is ``True``, log
messages will be displayed on the standard error. If :setting:`LOG_FILE` is set
and :setting:`LOG_FILE_APPEND` is ``False``, the file will be overwritten
(discarding the output from previous runs, if any). Lastly, if
:setting:`LOG_ENABLED` is ``False``, there won't be any visible log output.

:setting:`LOG_LEVEL` determines the minimum level of severity to display, those
messages with lower severity will be filtered out. It ranges through the
possible levels listed in :ref:`topics-logging-levels`.

:setting:`LOG_FORMAT` and :setting:`LOG_DATEFORMAT` specify formatting strings
used as layouts for all messages. Those strings can contain any placeholders
listed in :ref:`logging's logrecord attributes docs <logrecord-attributes>` and
:ref:`datetime's strftime and strptime directives <strftime-strptime-behavior>`
respectively.

If :setting:`LOG_SHORT_NAMES` is set, then the logs will not display the Scrapy
component that prints the log. It is unset by default, hence logs contain the
Scrapy component responsible for that log output.

Command-line options
--------------------

There are command-line arguments, available for all commands, that you can use
to override some of the Scrapy settings regarding logging.

* ``--logfile FILE``
    Overrides :setting:`LOG_FILE`
* ``--loglevel/-L LEVEL``
    Overrides :setting:`LOG_LEVEL`
* ``--nolog``
    Sets :setting:`LOG_ENABLED` to ``False``

.. seealso::

    Module :mod:`logging.handlers`
        Further documentation on available handlers

.. _custom-log-formats:

Custom Log Formats
------------------

A custom log format can be set for different actions by extending
:class:`~scrapy.logformatter.LogFormatter` class and making
:setting:`LOG_FORMATTER` point to your new class.

.. autoclass:: scrapy.logformatter.LogFormatter
   :members:

.. _topics-logging-advanced-customization:

Advanced customization
----------------------

Because Scrapy uses stdlib logging module, you can customize logging using
all features of stdlib logging.

For example, let's say you're scraping a website which returns many
HTTP 404 and 500 responses, and you want to hide all messages like this::

    2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
    response <500 https://quotes.toscrape.com/page/1-34/>: HTTP status code
    is not handled or not allowed

The first thing to note is a logger name - it is in brackets:
``[scrapy.spidermiddlewares.httperror]``. If you get just ``[scrapy]`` then
:setting:`LOG_SHORT_NAMES` is likely set to True; set it to False and re-run
the crawl.

Next, we can see that the message has INFO level. To hide it
we should set logging level for ``scrapy.spidermiddlewares.httperror``
higher than INFO; next level after INFO is WARNING. It could be done
e.g. in the spider's ``__init__`` method:

.. code-block:: python

    import logging
    import scrapy

    class MySpider(scrapy.Spider):
        # ...
        def __init__(self, *args, **kwargs):
            logger = logging.getLogger("scrapy.spidermiddlewares.httperror")
            logger.setLevel(logging.WARNING)
            super().__init__(*args, **kwargs)

If you run this spider again then INFO messages from
``scrapy.spidermiddlewares.httperror`` logger will be gone.

You can also filter log records by :class:`~logging.LogRecord` data. For
example, you can filter log records by message content using a substring or
a regular expression. Create a :class:`logging.Filter` subclass
and equip it with a regular expression pattern to
filter out unwanted messages:

.. code-block:: python

    import logging
    import re

    class ContentFilter(logging.Filter):
        def filter(self, record):
            match = re.search(r"\d{3} [Ee]rror, retrying", record.message)
            if match:
                return False

A project-level filter may be attached to the root
handler created by Scrapy, this is a wieldy way to
filter all loggers in different parts of the project
(middlewares, spider, etc.):

.. code-block:: python

 import logging
 import scrapy

 class MySpider(scrapy.Spider):
     # ...
     def __init__(self, *args, **kwargs):
         for handler in logging.root.handlers:
             handler.addFilter(ContentFilter())

Alternatively, you may choose a specific logger
and hide it without affecting other loggers:

.. code-block:: python

    import logging
    import scrapy

    class MySpider(scrapy.Spider):
        # ...
        def __init__(self, *args, **kwargs):
            logger = logging.getLogger("my_logger")
            logger.addFilter(ContentFilter())

scrapy.utils.log module
=======================

.. module:: scrapy.utils.log
   :synopsis: Logging utils

.. autofunction:: configure_logging

    ``configure_logging`` is automatically called when using Scrapy commands
    or :class:`~scrapy.crawler.CrawlerProcess`, but needs to be called explicitly
    when running custom scripts using :class:`~scrapy.crawler.CrawlerRunner`.
    In that case, its usage is not required but it's recommended.

    Another option when running custom scripts is to manually configure the logging.
    To do this you can use :func:`logging.basicConfig` to set a basic root handler.

    Note that :class:`~scrapy.crawler.CrawlerProcess` automatically calls ``configure_logging``,
    so it is recommended to only use :func:`logging.basicConfig` together with
    :class:`~scrapy.crawler.CrawlerRunner`.

    This is an example on how to redirect ``INFO`` or higher messages to a file:

    .. code-block:: python

        import logging

        logging.basicConfig(
            filename="log.txt", format="%(levelname)s: %(message)s", level=logging.INFO
        )

    Refer to :ref:`run-from-script` for more details about using Scrapy this
    way.


.. _topics-media-pipeline:

===========================================
Downloading and processing files and images
===========================================

.. currentmodule:: scrapy.pipelines.images

Scrapy provides reusable :doc:`item pipelines </topics/item-pipeline>` for
downloading files attached to a particular item (for example, when you scrape
products and also want to download their images locally). These pipelines share
a bit of functionality and structure (we refer to them as media pipelines), but
typically you'll either use the Files Pipeline or the Images Pipeline.

Both pipelines implement these features:

* Avoid re-downloading media that was downloaded recently
* Specifying where to store the media (filesystem directory, FTP server, Amazon S3 bucket,
  Google Cloud Storage bucket)

The Images Pipeline has a few extra functions for processing images:

* Convert all downloaded images to a common format (JPG) and mode (RGB)
* Thumbnail generation
* Check images width/height to make sure they meet a minimum constraint

The pipelines also keep an internal queue of those media URLs which are currently
being scheduled for download, and connect those responses that arrive containing
the same media to that queue. This avoids downloading the same media more than
once when it's shared by several items.

Using the Files Pipeline
========================

The typical workflow, when using the :class:`FilesPipeline` goes like
this:

1. In a Spider, you scrape an item and put the URLs of the desired into a
   ``file_urls`` field.

2. The item is returned from the spider and goes to the item pipeline.

3. When the item reaches the :class:`FilesPipeline`, the URLs in the
   ``file_urls`` field are scheduled for download using the standard
   Scrapy scheduler and downloader (which means the scheduler and downloader
   middlewares are reused), but with a higher priority, processing them before other
   pages are scraped. The item remains "locked" at that particular pipeline stage
   until the files have finish downloading (or fail for some reason).

4. When the files are downloaded, another field (``files``) will be populated
   with the results. This field will contain a list of dicts with information
   about the downloaded files, such as the downloaded path, the original
   scraped url (taken from the ``file_urls`` field), the file checksum and the file status.
   The files in the list of the ``files`` field will retain the same order of
   the original ``file_urls`` field. If some file failed downloading, an
   error will be logged and the file won't be present in the ``files`` field.

.. _images-pipeline:

Using the Images Pipeline
=========================

Using the :class:`ImagesPipeline` is a lot like using the :class:`FilesPipeline`,
except the default field names used are different: you use ``image_urls`` for
the image URLs of an item and it will populate an ``images`` field for the information
about the downloaded images.

The advantage of using the :class:`ImagesPipeline` for image files is that you
can configure some extra functions like generating thumbnails and filtering
the images based on their size.

The Images Pipeline requires Pillow_ 8.3.2 or greater. It is used for
thumbnailing and normalizing images to JPEG/RGB format.

.. _Pillow: https://github.com/python-pillow/Pillow

.. _topics-media-pipeline-enabling:

Enabling your Media Pipeline
============================

.. setting:: IMAGES_STORE
.. setting:: FILES_STORE

To enable your media pipeline you must first add it to your project
:setting:`ITEM_PIPELINES` setting.

For Images Pipeline, use:

.. code-block:: python

    ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}

For Files Pipeline, use:

.. code-block:: python

    ITEM_PIPELINES = {"scrapy.pipelines.files.FilesPipeline": 1}

.. note::
    You can also use both the Files and Images Pipeline at the same time.

Then, configure the target storage setting to a valid value that will be used
for storing the downloaded images. Otherwise the pipeline will remain disabled,
even if you include it in the :setting:`ITEM_PIPELINES` setting.

For the Files Pipeline, set the :setting:`FILES_STORE` setting:

.. code-block:: python

   FILES_STORE = "/path/to/valid/dir"

For the Images Pipeline, set the :setting:`IMAGES_STORE` setting:

.. code-block:: python

   IMAGES_STORE = "/path/to/valid/dir"

.. _topics-file-naming:

File Naming
===========

Default File Naming
-------------------

By default, files are stored using an `SHA-1 hash`_ of their URLs for the file names.

For example, the following image URL::

    http://www.example.com/image.jpg

Whose ``SHA-1 hash`` is::

    3afec3b4765f8f0a07b78f98c07b83f013567a0a

Will be downloaded and stored using your chosen :ref:`storage method <topics-supported-storage>` and the following file name::

   3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

Custom File Naming
-------------------

You may wish to use a different calculated file name for saved files.
For example, classifying an image by including meta in the file name.

Customize file names by overriding the ``file_path`` method of your
media pipeline.

For example, an image pipeline with image URL::

   http://www.example.com/product/images/large/front/0000000004166

Can be processed into a file name with a condensed hash and the perspective
``front``::

  00b08510e4_front.jpg

By overriding ``file_path`` like this:

.. code-block:: python

  import hashlib

  def file_path(self, request, response=None, info=None, *, item=None):
      image_url_hash = hashlib.shake_256(request.url.encode()).hexdigest(5)
      image_perspective = request.url.split("/")[-2]
      image_filename = f"{image_url_hash}_{image_perspective}.jpg"

      return image_filename

.. warning::
  If your custom file name scheme relies on meta data that can vary between
  scrapes it may lead to unexpected re-downloading of existing media using
  new file names.

  For example, if your custom file name scheme uses a product title and the
  site changes an item's product title between scrapes, Scrapy will re-download
  the same media using updated file names.

For more information about the ``file_path`` method, see :ref:`topics-media-pipeline-override`.

.. _topics-supported-storage:

Supported Storage
=================

File system storage
-------------------

File system storage will save files to the following path::

   <IMAGES_STORE>/full/<FILE_NAME>

Where:

* ``<IMAGES_STORE>`` is the directory defined in :setting:`IMAGES_STORE` setting
  for the Images Pipeline.

* ``full`` is a sub-directory to separate full images from thumbnails (if
  used). For more info see :ref:`topics-images-thumbnails`.

* ``<FILE_NAME>`` is the file name assigned to the file.  For more info see :ref:`topics-file-naming`.

.. _media-pipeline-ftp:

FTP server storage
------------------

:setting:`FILES_STORE` and :setting:`IMAGES_STORE` can point to an FTP server.
Scrapy will automatically upload the files to the server.

:setting:`FILES_STORE` and :setting:`IMAGES_STORE` should be written in one of the
following forms::

    ftp://username:password@address:port/path
    ftp://address:port/path

If ``username`` and ``password`` are not provided, they are taken from the :setting:`FTP_USER` and
:setting:`FTP_PASSWORD` settings respectively.

FTP supports two different connection modes: active or passive. Scrapy uses
the passive connection mode by default. To use the active connection mode instead,
set the :setting:`FEED_STORAGE_FTP_ACTIVE` setting to ``True``.

.. _media-pipelines-s3:

Amazon S3 storage
-----------------

.. setting:: FILES_STORE_S3_ACL
.. setting:: IMAGES_STORE_S3_ACL

If botocore_ >= 1.13.45 is installed, :setting:`FILES_STORE` and
:setting:`IMAGES_STORE` can represent an Amazon S3 bucket. Scrapy will
automatically upload the files to the bucket.

For example, this is a valid :setting:`IMAGES_STORE` value:

.. code-block:: python

    IMAGES_STORE = "s3://bucket/images"

You can modify the Access Control List (ACL) policy used for the stored files,
which is defined by the :setting:`FILES_STORE_S3_ACL` and
:setting:`IMAGES_STORE_S3_ACL` settings. By default, the ACL is set to
``private``. To make the files publicly available use the ``public-read``
policy:

.. code-block:: python

    IMAGES_STORE_S3_ACL = "public-read"

For more information, see `canned ACLs`_ in the Amazon S3 Developer Guide.

You can also use other S3-like storages. Storages like self-hosted `Minio`_ or
`Zenko CloudServer`_. All you need to do is set endpoint option in you Scrapy
settings:

.. code-block:: python

    AWS_ENDPOINT_URL = "http://minio.example.com:9000"

For self-hosting you also might feel the need not to use SSL and not to verify SSL connection:

.. code-block:: python

    AWS_USE_SSL = False  # or True (None by default)
    AWS_VERIFY = False  # or True (None by default)

.. _botocore: https://github.com/boto/botocore
.. _canned ACLs: https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html#canned-acl
.. _Minio: https://github.com/minio/minio
.. _Zenko CloudServer: https://www.zenko.io/cloudserver/

.. _media-pipeline-gcs:

Google Cloud Storage
---------------------

.. setting:: FILES_STORE_GCS_ACL
.. setting:: IMAGES_STORE_GCS_ACL

:setting:`FILES_STORE` and :setting:`IMAGES_STORE` can represent a Google Cloud Storage
bucket. Scrapy will automatically upload the files to the bucket. (requires `google-cloud-storage`_ )

.. _google-cloud-storage: https://docs.cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python

For example, these are valid :setting:`IMAGES_STORE` and :setting:`GCS_PROJECT_ID` settings:

.. code-block:: python

    IMAGES_STORE = "gs://bucket/images/"
    GCS_PROJECT_ID = "project_id"

For information about authentication, see this `documentation`_.

.. _documentation: https://docs.cloud.google.com/docs/authentication

You can modify the Access Control List (ACL) policy used for the stored files,
which is defined by the :setting:`FILES_STORE_GCS_ACL` and
:setting:`IMAGES_STORE_GCS_ACL` settings. By default, the ACL is set to
``''`` (empty string) which means that Cloud Storage applies the bucket's default object ACL to the object.
To make the files publicly available use the ``publicRead``
policy:

.. code-block:: python

    IMAGES_STORE_GCS_ACL = "publicRead"

For more information, see `Predefined ACLs`_ in the Google Cloud Platform Developer Guide.

.. _Predefined ACLs: https://docs.cloud.google.com/storage/docs/access-control/lists#predefined-acl

Usage example
=============

.. setting:: FILES_URLS_FIELD
.. setting:: FILES_RESULT_FIELD
.. setting:: IMAGES_URLS_FIELD
.. setting:: IMAGES_RESULT_FIELD

In order to use a media pipeline, first :ref:`enable it
<topics-media-pipeline-enabling>`.

Then, if a spider returns an :ref:`item object <topics-items>` with the URLs
field (``file_urls`` or ``image_urls``, for the Files or Images Pipeline
respectively), the pipeline will put the results under the respective field
(``files`` or ``images``).

When using :ref:`item types <item-types>` for which fields are defined beforehand,
you must define both the URLs field and the results field. For example, when
using the images pipeline, items must define both the ``image_urls`` and the
``images`` field. For instance, using the :class:`~scrapy.Item` class:

.. code-block:: python

    import scrapy

    class MyItem(scrapy.Item):
        # ... other item fields ...
        image_urls = scrapy.Field()
        images = scrapy.Field()

If you want to use another field name for the URLs key or for the results key,
it is also possible to override it.

For the Files Pipeline, set :setting:`FILES_URLS_FIELD` and/or
:setting:`FILES_RESULT_FIELD` settings:

.. code-block:: python

    FILES_URLS_FIELD = "field_name_for_your_files_urls"
    FILES_RESULT_FIELD = "field_name_for_your_processed_files"

For the Images Pipeline, set :setting:`IMAGES_URLS_FIELD` and/or
:setting:`IMAGES_RESULT_FIELD` settings:

.. code-block:: python

    IMAGES_URLS_FIELD = "field_name_for_your_images_urls"
    IMAGES_RESULT_FIELD = "field_name_for_your_processed_images"

If you need something more complex and want to override the custom pipeline
behaviour, see :ref:`topics-media-pipeline-override`.

If you have multiple image pipelines inheriting from ImagePipeline and you want
to have different settings in different pipelines you can set setting keys
preceded with uppercase name of your pipeline class. E.g. if your pipeline is
called MyPipeline and you want to have custom IMAGES_URLS_FIELD you define
setting MYPIPELINE_IMAGES_URLS_FIELD and your custom settings will be used.

Additional features
===================

.. _file-expiration:

File expiration
---------------

.. setting:: IMAGES_EXPIRES
.. setting:: FILES_EXPIRES

The Image Pipeline avoids downloading files that were downloaded recently. To
adjust this retention delay use the :setting:`FILES_EXPIRES` setting (or
:setting:`IMAGES_EXPIRES`, in case of Images Pipeline), which
specifies the delay in number of days:

.. code-block:: python

    # 120 days of delay for files expiration
    FILES_EXPIRES = 120

    # 30 days of delay for images expiration
    IMAGES_EXPIRES = 30

The default value for both settings is 90 days.

If you have pipeline that subclasses FilesPipeline and you'd like to have
different setting for it you can set setting keys preceded by uppercase
class name. E.g. given pipeline class called MyPipeline you can set setting key:

    MYPIPELINE_FILES_EXPIRES = 180

and pipeline class MyPipeline will have expiration time set to 180.

The last modified time from the file is used to determine the age of the file in days,
which is then compared to the set expiration time to determine if the file is expired.

.. _topics-images-thumbnails:

Thumbnail generation for images
-------------------------------

The Images Pipeline can automatically create thumbnails of the downloaded
images.

.. setting:: IMAGES_THUMBS

In order to use this feature, you must set :setting:`IMAGES_THUMBS` to a dictionary
where the keys are the thumbnail names and the values are their dimensions.

For example:

.. code-block:: python

   IMAGES_THUMBS = {
       "small": (50, 50),
       "big": (270, 270),
   }

When you use this feature, the Images Pipeline will create thumbnails of the
each specified size with this format::

    <IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg

Where:

* ``<size_name>`` is the one specified in the :setting:`IMAGES_THUMBS`
  dictionary keys (``small``, ``big``, etc)

* ``<image_id>`` is the `SHA-1 hash`_ of the image url

.. _SHA-1 hash: https://en.wikipedia.org/wiki/SHA_hash_functions

Example of image files stored using ``small`` and ``big`` thumbnail names::

   <IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
   <IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
   <IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg

The first one is the full image, as downloaded from the site.

Filtering out small images
--------------------------

.. setting:: IMAGES_MIN_HEIGHT

.. setting:: IMAGES_MIN_WIDTH

When using the Images Pipeline, you can drop images which are too small, by
specifying the minimum allowed size in the :setting:`IMAGES_MIN_HEIGHT` and
:setting:`IMAGES_MIN_WIDTH` settings.

For example::

   IMAGES_MIN_HEIGHT = 110
   IMAGES_MIN_WIDTH = 110

.. note::
    The size constraints don't affect thumbnail generation at all.

It is possible to set just one size constraint or both. When setting both of
them, only images that satisfy both minimum sizes will be saved. For the
above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will
all be dropped because at least one dimension is shorter than the constraint.

By default, there are no size constraints, so all images are processed.

Allowing redirections
---------------------

.. setting:: MEDIA_ALLOW_REDIRECTS

By default media pipelines ignore redirects, i.e. an HTTP redirection
to a media file URL request will mean the media download is considered failed.

To handle media redirections, set this setting to ``True``::

    MEDIA_ALLOW_REDIRECTS = True

.. _topics-media-pipeline-override:

Extending the Media Pipelines
=============================

.. module:: scrapy.pipelines.files
   :synopsis: Files Pipeline

See here the methods that you can override in your custom Files Pipeline:

.. class:: FilesPipeline

   .. method:: file_path(self, request, response=None, info=None, *, item=None)

      This method is called once per downloaded item. It returns the
      download path of the file originating from the specified
      :class:`response <scrapy.http.Response>`.

      In addition to ``response``, this method receives the original
      :class:`request <scrapy.Request>`,
      :class:`info <scrapy.pipelines.media.MediaPipeline.SpiderInfo>` and
      :class:`item <scrapy.Item>`

      You can override this method to customize the download path of each file.

      For example, if file URLs end like regular paths (e.g.
      ``https://example.com/a/b/c/foo.png``), you can use the following
      approach to download all files into the ``files`` folder with their
      original filenames (e.g. ``files/foo.png``):

      .. code-block:: python

        from pathlib import PurePosixPath
        from scrapy.utils.httpobj import urlparse_cached

        from scrapy.pipelines.files import FilesPipeline

        class MyFilesPipeline(FilesPipeline):
            def file_path(self, request, response=None, info=None, *, item=None):
                return "files/" + PurePosixPath(urlparse_cached(request).path).name

      Similarly, you can use the ``item`` to determine the file path based on some item
      property.

      By default the :meth:`file_path` method returns
      ``full/<request URL hash>.<extension>``.

   .. method:: FilesPipeline.get_media_requests(item, info)

      As seen on the workflow, the pipeline will get the URLs of the images to
      download from the item. In order to do this, you can override the
      :meth:`~get_media_requests` method and return a Request for each
      file URL:

      .. code-block:: python

         from itemadapter import ItemAdapter

         def get_media_requests(self, item, info):
             adapter = ItemAdapter(item)
             for file_url in adapter["file_urls"]:
                 yield scrapy.Request(file_url)

      Those requests will be processed by the pipeline and, when they have finished
      downloading, the results will be sent to the
      :meth:`~item_completed` method, as a list of 2-element tuples.
      Each tuple will contain ``(success, file_info_or_error)`` where:

      * ``success`` is a boolean which is ``True`` if the image was downloaded
        successfully or ``False`` if it failed for some reason

      * ``file_info_or_error`` is a dict containing the following keys (if
        success is ``True``) or a :exc:`~twisted.python.failure.Failure` if
        there was a problem.

        * ``url`` - the url where the file was downloaded from. This is the url of
          the request returned from the :meth:`~get_media_requests`
          method.

        * ``path`` - the path (relative to :setting:`FILES_STORE`) where the file
          was stored

        * ``checksum`` - a `MD5 hash`_ of the image contents

        * ``status`` - the file status indication.

          It can be one of the following:

          * ``downloaded`` - file was downloaded.
          * ``uptodate`` - file was not downloaded, as it was downloaded recently,
            according to the file expiration policy.
          * ``cached`` - file was already scheduled for download, by another item
            sharing the same file.

      The list of tuples received by :meth:`~item_completed` is
      guaranteed to retain the same order of the requests returned from the
      :meth:`~get_media_requests` method.

      Here's a typical value of the ``results`` argument:

      .. invisible-code-block: python

          from twisted.python.failure import Failure

      .. code-block:: python

          [
              (
                  True,
                  {
                      "checksum": "2b00042f7481c7b056c4b410d28f33cf",
                      "path": "full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg",
                      "url": "http://www.example.com/files/product1.pdf",
                      "status": "downloaded",
                  },
              ),
              (False, Failure(...)),
          ]

      By default the :meth:`get_media_requests` method returns ``None`` which
      means there are no files to download for the item.

   .. method:: FilesPipeline.item_completed(results, item, info)

      The :meth:`FilesPipeline.item_completed` method called when all file
      requests for a single item have completed (either finished downloading, or
      failed for some reason).

      The :meth:`~item_completed` method must return the
      output that will be sent to subsequent item pipeline stages, so you must
      return (or drop) the item, as you would in any pipeline.

      Here is an example of the :meth:`~item_completed` method where we
      store the downloaded file paths (passed in results) in the ``file_paths``
      item field, and we drop the item if it doesn't contain any files:

      .. code-block:: python

          from itemadapter import ItemAdapter
          from scrapy.exceptions import DropItem

          def item_completed(self, results, item, info):
              file_paths = [x["path"] for ok, x in results if ok]
              if not file_paths:
                  raise DropItem("Item contains no files")
              adapter = ItemAdapter(item)
              adapter["file_paths"] = file_paths
              return item

      By default, the :meth:`item_completed` method returns the item.

.. module:: scrapy.pipelines.images
   :synopsis: Images Pipeline

See here the methods that you can override in your custom Images Pipeline:

.. class:: ImagesPipeline

    The :class:`ImagesPipeline` is an extension of the :class:`FilesPipeline`,
    customizing the field names and adding custom behavior for images.

   .. method:: file_path(self, request, response=None, info=None, *, item=None)

      This method is called once per downloaded item. It returns the
      download path of the file originating from the specified
      :class:`response <scrapy.http.Response>`.

      In addition to ``response``, this method receives the original
      :class:`request <scrapy.Request>`,
      :class:`info <scrapy.pipelines.media.MediaPipeline.SpiderInfo>` and
      :class:`item <scrapy.Item>`

      You can override this method to customize the download path of each file.

      For example, if file URLs end like regular paths (e.g.
      ``https://example.com/a/b/c/foo.png``), you can use the following
      approach to download all files into the ``files`` folder with their
      original filenames (e.g. ``files/foo.png``):

      .. code-block:: python

        from pathlib import PurePosixPath
        from scrapy.utils.httpobj import urlparse_cached

        from scrapy.pipelines.images import ImagesPipeline

        class MyImagesPipeline(ImagesPipeline):
            def file_path(self, request, response=None, info=None, *, item=None):
                return "files/" + PurePosixPath(urlparse_cached(request).path).name

      Similarly, you can use the ``item`` to determine the file path based on some item
      property.

      By default the :meth:`file_path` method returns
      ``full/<request URL hash>.<extension>``.

   .. method:: ImagesPipeline.thumb_path(self, request, thumb_id, response=None, info=None, *, item=None)

      This method is called for every item of  :setting:`IMAGES_THUMBS` per downloaded item. It returns the
      thumbnail download path of the image originating from the specified
      :class:`response <scrapy.http.Response>`.

      In addition to ``response``, this method receives the original
      :class:`request <scrapy.Request>`,
      ``thumb_id``,
      :class:`info <scrapy.pipelines.media.MediaPipeline.SpiderInfo>` and
      :class:`item <scrapy.Item>`.

      You can override this method to customize the thumbnail download path of each image.
      You can use the ``item`` to determine the file path based on some item
      property.

      By default the :meth:`thumb_path` method returns
      ``thumbs/<size name>/<request URL hash>.<extension>``.

   .. method:: ImagesPipeline.get_media_requests(item, info)

      Works the same way as :meth:`FilesPipeline.get_media_requests` method,
      but using a different field name for image urls.

      Must return a Request for each image URL.

   .. method:: ImagesPipeline.item_completed(results, item, info)

      The :meth:`ImagesPipeline.item_completed` method is called when all image
      requests for a single item have completed (either finished downloading, or
      failed for some reason).

      Works the same way as :meth:`FilesPipeline.item_completed` method,
      but using a different field names for storing image downloading results.

      By default, the :meth:`item_completed` method returns the item.

.. _media-pipeline-example:

Custom Images pipeline example
==============================

Here is a full example of the Images Pipeline whose methods are exemplified
above:

.. code-block:: python

    import scrapy
    from itemadapter import ItemAdapter
    from scrapy.exceptions import DropItem
    from scrapy.pipelines.images import ImagesPipeline

    class MyImagesPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            for image_url in item["image_urls"]:
                yield scrapy.Request(image_url)

        def item_completed(self, results, item, info):
            image_paths = [x["path"] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            adapter = ItemAdapter(item)
            adapter["image_paths"] = image_paths
            return item

To enable your custom media pipeline component you must add its class import path to the
:setting:`ITEM_PIPELINES` setting, like in the following example:

.. code-block:: python

   ITEM_PIPELINES = {"myproject.pipelines.MyImagesPipeline": 300}

.. _MD5 hash: https://en.wikipedia.org/wiki/MD5


.. _topics-practices:

================
Common Practices
================

This section documents common practices when using Scrapy. These are things
that cover many topics and don't often fall into any other specific section.

.. skip: start

.. _run-from-script:

Run Scrapy from a script
========================

You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
the typical way of running Scrapy via ``scrapy crawl``.

Remember that Scrapy is built on top of the Twisted
asynchronous networking library, so you need to run it inside the Twisted reactor.

The first utility you can use to run your spiders is
:class:`scrapy.crawler.AsyncCrawlerProcess` or
:class:`scrapy.crawler.CrawlerProcess`. These classes will start a Twisted
reactor for you, configuring the logging and setting shutdown handlers. These
classes are the ones used by all Scrapy commands. They have similar
functionality, differing in their asynchronous API style:
:class:`~scrapy.crawler.AsyncCrawlerProcess` returns coroutines from its
asynchronous methods while :class:`~scrapy.crawler.CrawlerProcess` returns
:class:`~twisted.internet.defer.Deferred` objects.

Here's an example showing how to run a single spider with it.

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    process = AsyncCrawlerProcess(
        settings={
            "FEEDS": {
                "items.json": {"format": "json"},
            },
        }
    )

    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished

You can define :ref:`settings <topics-settings>` within the dictionary passed
to :class:`~scrapy.crawler.AsyncCrawlerProcess`. Make sure to check the
:class:`~scrapy.crawler.AsyncCrawlerProcess`
documentation to get acquainted with its usage details.

If you are inside a Scrapy project there are some additional helpers you can
use to import those components within the project. You can automatically import
your spiders passing their name to
:class:`~scrapy.crawler.AsyncCrawlerProcess`, and use
:func:`scrapy.utils.project.get_project_settings` to get a
:class:`~scrapy.settings.Settings` instance with your project settings.

What follows is a working example of how to do that, using the `testspiders`_
project as example.

.. code-block:: python

    from scrapy.crawler import AsyncCrawlerProcess
    from scrapy.utils.project import get_project_settings

    process = AsyncCrawlerProcess(get_project_settings())

    # 'followall' is the name of one of the spiders of the project.
    process.crawl("followall", domain="scrapy.org")
    process.start()  # the script will block here until the crawling is finished

There's another Scrapy utility that provides more control over the crawling
process: :class:`scrapy.crawler.AsyncCrawlerRunner` or
:class:`scrapy.crawler.CrawlerRunner`. These classes are thin wrappers
that encapsulate some simple helpers to run multiple crawlers, but they won't
start or interfere with existing reactors in any way. Just like
:class:`scrapy.crawler.AsyncCrawlerProcess` and
:class:`scrapy.crawler.CrawlerProcess` they differ in their asynchronous API
style.

When using these classes the reactor should be explicitly run after scheduling
your spiders. It's recommended that you use
:class:`~scrapy.crawler.AsyncCrawlerRunner` or
:class:`~scrapy.crawler.CrawlerRunner` instead of
:class:`~scrapy.crawler.AsyncCrawlerProcess` or
:class:`~scrapy.crawler.CrawlerProcess` if your application is already using
Twisted and you want to run Scrapy in the same reactor.

If you want to stop the reactor or run any other code right after the spider
finishes you can do that after the task returned from
:meth:`AsyncCrawlerRunner.crawl() <scrapy.crawler.AsyncCrawlerRunner.crawl>`
completes (or the Deferred returned from :meth:`CrawlerRunner.crawl()
<scrapy.crawler.CrawlerRunner.crawl>` fires). In the simplest case you can also
use :func:`twisted.internet.task.react` to start and stop the reactor, though
it may be easier to just use :class:`~scrapy.crawler.AsyncCrawlerProcess` or
:class:`~scrapy.crawler.CrawlerProcess` instead.

Here's an example of using :class:`~scrapy.crawler.AsyncCrawlerRunner` together
with simple reactor management code:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        await runner.crawl(MySpider)  # completes when the spider finishes

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))

Same example but using :class:`~scrapy.crawler.CrawlerRunner` and a
different reactor (:class:`~scrapy.crawler.AsyncCrawlerRunner` only works
with :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`):

.. code-block:: python

    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react

    class MySpider(scrapy.Spider):
        custom_settings = {
            "TWISTED_REACTOR": "twisted.internet.epollreactor.EPollReactor",
        }
        # Your spider definition
        ...

    def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        return d  # this Deferred fires when the spider finishes

    install_reactor("twisted.internet.epollreactor.EPollReactor")
    react(crawl)

.. seealso:: :doc:`twisted:core/howto/reactor-basics`

And here are examples of using these classes with
:setting:`TWISTED_REACTOR_ENABLED` set to ``False``.

Simple usage of :class:`~scrapy.crawler.AsyncCrawlerProcess`:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    process = AsyncCrawlerProcess(
        settings={
            "TWISTED_REACTOR_ENABLED": False,
        }
    )

    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished

With ``TWISTED_REACTOR_ENABLED=False`` you can use several instances of
:class:`~scrapy.crawler.AsyncCrawlerProcess` in the same process:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    process1 = AsyncCrawlerProcess(
        settings={
            "TWISTED_REACTOR_ENABLED": False,
        }
    )
    process1.crawl(MySpider)
    process1.start()

    process2 = AsyncCrawlerProcess(
        settings={
            "TWISTED_REACTOR_ENABLED": False,
        }
    )
    process2.crawl(MySpider)
    process2.start()

Using :func:`asyncio.run` with :class:`~scrapy.crawler.AsyncCrawlerRunner`:

.. code-block:: python

    import asyncio

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.log import configure_logging

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    async def main():
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner(settings={"TWISTED_REACTOR_ENABLED": False})
        await runner.crawl(MySpider)  # completes when the spider finishes

    asyncio.run(main())

.. _run-multiple-spiders:

Running multiple spiders in the same process
============================================

By default, Scrapy runs a single spider per process when you run ``scrapy
crawl``. However, Scrapy supports running multiple spiders per process using
the :ref:`internal API <topics-api>`.

Here is an example that runs multiple spiders simultaneously:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess
    from scrapy.utils.project import get_project_settings

    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...

    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...

    settings = get_project_settings()
    process = AsyncCrawlerProcess(settings)
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start()  # the script will block here until all crawling jobs are finished

Same example using :class:`~scrapy.crawler.AsyncCrawlerRunner`:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react

    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...

    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...

    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        runner.crawl(MySpider1)
        runner.crawl(MySpider2)
        await runner.join()  # completes when both spiders finish

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))

Same example but running the spiders sequentially by awaiting until each one
finishes before starting the next one:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react

    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...

    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...

    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        await runner.crawl(MySpider1)
        await runner.crawl(MySpider2)

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))

.. note:: When running multiple spiders in the same process, :ref:`reactor
    settings <reactor-settings>` should not have a different value per spider.
    Also, :ref:`pre-crawler settings <pre-crawler-settings>` cannot be defined
    per spider.

.. seealso:: :ref:`run-from-script`.

.. skip: end

.. _distributed-crawls:

Distributed crawls
==================

Scrapy doesn't provide any built-in facility for running crawls in a distributed
(multi-server) manner. However, there are some ways to distribute crawls, which
vary depending on how you plan to distribute them.

If you have many spiders, the obvious way to distribute the load is to setup
many Scrapyd instances and distribute spider runs among those.

If you instead want to run a single (big) spider through many machines, what
you usually do is partition the URLs to crawl and send them to each separate
spider. Here is a concrete example:

First, you prepare the list of URLs to crawl and put them into separate
files/urls::

    http://somedomain.com/urls-to-crawl/spider1/part1.list
    http://somedomain.com/urls-to-crawl/spider1/part2.list
    http://somedomain.com/urls-to-crawl/spider1/part3.list

Then you fire a spider run on 3 different Scrapyd servers. The spider would
receive a (spider) argument ``part`` with the number of the partition to
crawl::

    curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
    curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
    curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

.. _bans:

Avoiding getting banned
=======================

Some websites implement certain measures to prevent bots from crawling them,
with varying degrees of sophistication. Getting around those measures can be
difficult and tricky, and may sometimes require special infrastructure. Please
consider contacting `commercial support`_ if in doubt.

Here are some tips to keep in mind when dealing with these kinds of sites:

* rotate your user agent from a pool of well-known ones from browsers (Google
  around to get a list of them)
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
  cookies to spot bot behaviour
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
* if possible, use `Common Crawl`_ to fetch pages, instead of hitting the sites
  directly
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
  services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a
  super proxy that you can attach your own proxies to.
* use a ban avoidance service, such as `Zyte API`_, which provides a `Scrapy
  plugin <https://github.com/scrapy-plugins/scrapy-zyte-api>`__ and additional
  features, like `AI web scraping <https://www.zyte.com/ai-web-scraping/>`__

If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.

.. _Tor project: https://www.torproject.org/
.. _commercial support: https://www.scrapy.org/companies
.. _ProxyMesh: https://proxymesh.com/
.. _Common Crawl: https://commoncrawl.org/
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _scrapoxy: https://scrapoxy.io/
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html


.. _topics-request-response:

======================
Requests and Responses
======================

.. module:: scrapy.http
   :synopsis: Request and Response classes

Scrapy uses :class:`~scrapy.Request` and :class:`Response` objects for crawling web
sites.

Typically, :class:`~scrapy.Request` objects are generated in the spiders and pass
across the system until they reach the Downloader, which executes the request
and returns a :class:`Response` object which travels back to the spider that
issued the request.

Both :class:`~scrapy.Request` and :class:`Response` classes have subclasses which add
functionality not required in the base classes. These are described
below in :ref:`topics-request-response-ref-request-subclasses` and
:ref:`topics-request-response-ref-response-subclasses`.

Request objects
===============

.. autoclass:: scrapy.Request

    :param url: the URL of this request

        If the URL is invalid, a :exc:`ValueError` exception is raised.
    :type url: str

    :param callback: sets :attr:`callback`, defaults to ``None``.
    :type callback: Callable[Concatenate[Response, ...], Any] | None

    :param method: the HTTP method of this request. Defaults to ``'GET'``.
    :type method: str

    :param meta: the initial values for the :attr:`.Request.meta` attribute. If
       given, the dict passed in this parameter will be shallow copied.
    :type meta: dict

    :param body: the request body. If a string is passed, then it's encoded as
      bytes using the ``encoding`` passed (which defaults to ``utf-8``). If
      ``body`` is not given, an empty bytes object is stored. Regardless of the
      type of this argument, the final value stored will be a bytes object
      (never a string or ``None``).
    :type body: bytes or str

    :param headers: the headers of this request. The dict values can be strings
       (for single valued headers) or lists (for multi-valued headers). If
       ``None`` is passed as value, the HTTP header will not be sent at all.

       .. caution:: Cookies set via the ``Cookie`` header are not considered by the
           :ref:`cookies-mw`. If you need to set cookies for a request, use the
           ``cookies`` argument. This is a known current limitation that is being
           worked on.

    :type headers: dict

    :param cookies: the request cookies. These can be sent in two forms.

        .. invisible-code-block: python

            from scrapy.http import Request

        1. Using a dict:

        .. code-block:: python

            request_with_cookies = Request(
                url="http://www.example.com",
                cookies={"currency": "USD", "country": "UY"},
            )

        2. Using a list of dicts:

        .. code-block:: python

            request_with_cookies = Request(
                url="https://www.example.com",
                cookies=[
                    {
                        "name": "currency",
                        "value": "USD",
                        "domain": "example.com",
                        "path": "/currency",
                        "secure": True,
                    },
                ],
            )

        The latter form allows for customizing the ``domain`` and ``path``
        attributes of the cookie. This is only useful if the cookies are saved
        for later requests.

        .. reqmeta:: dont_merge_cookies

        When some site returns cookies (in a response) those are stored in the
        cookies for that domain and will be sent again in future requests.
        That's the typical behaviour of any regular web browser.

        Note that setting the :reqmeta:`dont_merge_cookies` key to ``True`` in
        :attr:`request.meta <scrapy.Request.meta>` causes custom cookies to be
        ignored.

        For more info see :ref:`cookies-mw`.

        .. caution:: Cookies set via the ``Cookie`` header are not considered by the
            :ref:`cookies-mw`. If you need to set cookies for a request, use the
            :class:`scrapy.Request.cookies <scrapy.Request>` parameter. This is a known
            current limitation that is being worked on.

    :type cookies: dict or list

    :param encoding: the encoding of this request (defaults to ``'utf-8'``).
       This encoding will be used to percent-encode the URL and to convert the
       body to bytes (if given as a string).
    :type encoding: str

    :param priority: sets :attr:`priority`, defaults to ``0``.
    :type priority: int

    :param dont_filter: sets :attr:`dont_filter`, defaults to ``False``.
    :type dont_filter: bool

    :param errback: sets :attr:`errback`, defaults to ``None``.
    :type errback: Callable[[Failure], Any] | None

    :param flags:  Flags sent to the request, can be used for logging or similar purposes.
    :type flags: list

    :param cb_kwargs: A dict with arbitrary data that will be passed as keyword arguments to the Request's callback.
    :type cb_kwargs: dict

    .. attribute:: Request.url

        A string containing the URL of this request. Keep in mind that this
        attribute contains the escaped URL, so it can differ from the URL passed in
        the ``__init__()`` method.

        This attribute is read-only. To change the URL of a Request use
        :meth:`replace`.

    .. attribute:: Request.method

        A string representing the HTTP method in the request. This is guaranteed to
        be uppercase. Example: ``"GET"``, ``"POST"``, ``"PUT"``, etc

    .. attribute:: Request.headers

        A dictionary-like (:class:`scrapy.http.headers.Headers`) object which contains
        the request headers.

    .. attribute:: Request.body

        The request body as bytes.

        This attribute is read-only. To change the body of a Request use
        :meth:`replace`.

    .. autoattribute:: callback

    .. autoattribute:: errback

    .. autoattribute:: priority

    .. attribute:: Request.cb_kwargs

        A dictionary that contains arbitrary metadata for this request. Its contents
        will be passed to the Request's callback as keyword arguments. It is empty
        for new Requests, which means by default callbacks only get a
        :class:`~scrapy.http.Response` object as argument.

        This dict is :doc:`shallow copied <library/copy>` when the request is
        cloned using the ``copy()`` or ``replace()`` methods, and can also be
        accessed, in your spider, from the ``response.cb_kwargs`` attribute.

        In case of a failure to process the request, this dict can be accessed as
        ``failure.request.cb_kwargs`` in the request's errback. For more information,
        see :ref:`errback-cb_kwargs`.

    .. attribute:: Request.meta
       :value: {}

        A dictionary of arbitrary metadata for the request.

        You may extend request metadata as you see fit.

        Request metadata can also be accessed through the
        :attr:`~scrapy.http.Response.meta` attribute of a response.

        To pass data from one spider callback to another, consider using
        :attr:`cb_kwargs` instead. However, request metadata may be the right
        choice in certain scenarios, such as to maintain some debugging data
        across all follow-up requests (e.g. the source URL).

        A common use of request metadata is to define request-specific
        parameters for Scrapy components (extensions, middlewares, etc.). For
        example, if you set ``dont_retry`` to ``True``,
        :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` will never
        retry that request, even if it fails. See :ref:`topics-request-meta`.

        You may also use request metadata in your custom Scrapy components, for
        example, to keep request state information relevant to your component.
        For example,
        :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` uses the
        ``retry_times`` metadata key to keep track of how many times a request
        has been retried so far.

        Copying all the metadata of a previous request into a new, follow-up
        request in a spider callback is a bad practice, because request
        metadata may include metadata set by Scrapy components that is not
        meant to be copied into other requests. For example, copying the
        ``retry_times`` metadata key into follow-up requests can lower the
        amount of retries allowed for those follow-up requests.

        You should only copy all request metadata from one request to another
        if the new request is meant to replace the old request, as is often the
        case when returning a request from a :ref:`downloader middleware
        <topics-downloader-middleware>` method.

        Also mind that the :meth:`copy` and :meth:`replace` request methods
        :doc:`shallow-copy <library/copy>` request metadata.

    .. autoattribute:: dont_filter

    .. autoattribute:: Request.attributes

    .. method:: Request.copy()

       Return a new Request which is a copy of this Request. See also:
       :ref:`topics-request-response-ref-request-callback-arguments`.

    .. method:: Request.replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])

       Return a Request object with the same members, except for those members
       given new values by whichever keyword arguments are specified. The
       :attr:`~scrapy.Request.cb_kwargs` and :attr:`~scrapy.Request.meta` attributes are shallow
       copied by default (unless new values are given as arguments). See also
       :ref:`topics-request-response-ref-request-callback-arguments`.

    .. automethod:: from_curl

    .. automethod:: to_dict

Other functions related to requests
-----------------------------------

.. autofunction:: scrapy.http.request.NO_CALLBACK

.. autofunction:: scrapy.utils.request.request_from_dict

.. _topics-request-response-ref-request-callback-arguments:

Passing additional data to callback functions
---------------------------------------------

The callback of a request is a function that will be called when the response
of that request is downloaded. The callback function will be called with the
downloaded :class:`Response` object as its first argument.

Example:

.. code-block:: python

    def parse_page1(self, response):
        return scrapy.Request(
            "http://www.example.com/some_page.html", callback=self.parse_page2
        )

    def parse_page2(self, response):
        # this would log http://www.example.com/some_page.html
        self.logger.info("Visited %s", response.url)

In some cases you may be interested in passing arguments to those callback
functions so you can receive the arguments later, in the second callback.
The following example shows how to achieve this by using the
:attr:`.Request.cb_kwargs` attribute:

.. code-block:: python

    def parse(self, response):
        request = scrapy.Request(
            "http://www.example.com/index.html",
            callback=self.parse_page2,
            cb_kwargs=dict(main_url=response.url),
        )
        request.cb_kwargs["foo"] = "bar"  # add more arguments for the callback
        yield request

    def parse_page2(self, response, main_url, foo):
        yield dict(
            main_url=main_url,
            other_url=response.url,
            foo=foo,
        )

.. caution:: :attr:`.Request.cb_kwargs` was introduced in version ``1.7``.
   Prior to that, using :attr:`.Request.meta` was recommended for passing
   information around callbacks. After ``1.7``, :attr:`.Request.cb_kwargs`
   became the preferred way for handling user information, leaving :attr:`.Request.meta`
   for communication with components like middlewares and extensions.

.. _topics-request-response-ref-errbacks:

Using errbacks to catch exceptions in request processing
--------------------------------------------------------

The errback of a request is a function that will be called when an exception
is raise while processing it.

It receives a :exc:`~twisted.python.failure.Failure` as first parameter and can
be used to track connection establishment timeouts, DNS errors etc.

Here's an example spider logging all errors and catching some specific
errors if needed:

.. code-block:: python

    import scrapy

    from scrapy.spidermiddlewares.httperror import HttpError
    from twisted.internet.error import DNSLookupError
    from twisted.internet.error import TimeoutError, TCPTimedOutError

    class ErrbackSpider(scrapy.Spider):
        name = "errback_example"
        start_urls = [
            "http://www.httpbin.org/",  # HTTP 200 expected
            "http://www.httpbin.org/status/404",  # Not found error
            "http://www.httpbin.org/status/500",  # server issue
            "http://www.httpbin.org:12345/",  # non-responding host, timeout expected
            "https://example.invalid/",  # DNS error expected
        ]

        async def start(self):
            for u in self.start_urls:
                yield scrapy.Request(
                    u,
                    callback=self.parse_httpbin,
                    errback=self.errback_httpbin,
                    dont_filter=True,
                )

        def parse_httpbin(self, response):
            self.logger.info("Got successful response from {}".format(response.url))
            # do something useful here...

        def errback_httpbin(self, failure):
            # log all failures
            self.logger.error(repr(failure))

            # in case you want to do something special for some errors,
            # you may need the failure's type:

            if failure.check(HttpError):
                # these exceptions come from HttpError spider middleware
                # you can get the non-200 response
                response = failure.value.response
                self.logger.error("HttpError on %s", response.url)

            elif failure.check(DNSLookupError):
                # this is the original request
                request = failure.request
                self.logger.error("DNSLookupError on %s", request.url)

            elif failure.check(TimeoutError, TCPTimedOutError):
                request = failure.request
                self.logger.error("TimeoutError on %s", request.url)

.. _errback-cb_kwargs:

Accessing additional data in errback functions
----------------------------------------------

In case of a failure to process the request, you may be interested in
accessing arguments to the callback functions so you can process further
based on the arguments in the errback. The following example shows how to
achieve this by using ``Failure.request.cb_kwargs``:

.. code-block:: python

    def parse(self, response):
        request = scrapy.Request(
            "http://www.example.com/index.html",
            callback=self.parse_page2,
            errback=self.errback_page2,
            cb_kwargs=dict(main_url=response.url),
        )
        yield request

    def parse_page2(self, response, main_url):
        pass

    def errback_page2(self, failure):
        yield dict(
            main_url=failure.request.cb_kwargs["main_url"],
        )

.. _request-fingerprints:

Request fingerprints
--------------------

There are some aspects of scraping, such as filtering out duplicate requests
(see :setting:`DUPEFILTER_CLASS`) or caching responses (see
:setting:`HTTPCACHE_POLICY`), where you need the ability to generate a short,
unique identifier from a :class:`~scrapy.Request` object: a request
fingerprint.

You often do not need to worry about request fingerprints, the default request
fingerprinter works for most projects.

However, there is no universal way to generate a unique identifier from a
request, because different situations require comparing requests differently.
For example, sometimes you may need to compare URLs case-insensitively, include
URL fragments, exclude certain URL query parameters, include some or all
headers, etc.

To change how request fingerprints are built for your requests, use the
:setting:`REQUEST_FINGERPRINTER_CLASS` setting.

.. setting:: REQUEST_FINGERPRINTER_CLASS

REQUEST_FINGERPRINTER_CLASS
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Default: :class:`scrapy.utils.request.RequestFingerprinter`

A :ref:`request fingerprinter class <custom-request-fingerprinter>` or its
import path.

.. autoclass:: scrapy.utils.request.RequestFingerprinter

.. _custom-request-fingerprinter:

Writing your own request fingerprinter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A request fingerprinter is a :ref:`component <topics-components>` that must
implement the following method:

.. currentmodule:: None

.. method:: fingerprint(self, request: scrapy.Request)

   Return a :class:`bytes` object that uniquely identifies *request*.

   See also :ref:`request-fingerprint-restrictions`.

.. currentmodule:: scrapy.http

The :meth:`fingerprint` method of the default request fingerprinter,
:class:`scrapy.utils.request.RequestFingerprinter`, uses
:func:`scrapy.utils.request.fingerprint` with its default parameters. For some
common use cases you can use :func:`scrapy.utils.request.fingerprint` as well
in your :meth:`fingerprint` method implementation:

.. autofunction:: scrapy.utils.request.fingerprint

For example, to take the value of a request header named ``X-ID`` into
account:

.. code-block:: python

    # my_project/settings.py
    REQUEST_FINGERPRINTER_CLASS = "my_project.utils.RequestFingerprinter"

    # my_project/utils.py
    from scrapy.utils.request import fingerprint

    class RequestFingerprinter:
        def fingerprint(self, request):
            return fingerprint(request, include_headers=["X-ID"])

You can also write your own fingerprinting logic from scratch.

However, if you do not use :func:`scrapy.utils.request.fingerprint`, make sure
you use :class:`~weakref.WeakKeyDictionary` to cache request fingerprints:

-   Caching saves CPU by ensuring that fingerprints are calculated only once
    per request, and not once per Scrapy component that needs the fingerprint
    of a request.

-   Using :class:`~weakref.WeakKeyDictionary` saves memory by ensuring that
    request objects do not stay in memory forever just because you have
    references to them in your cache dictionary.

For example, to take into account only the URL of a request, without any prior
URL canonicalization or taking the request method or body into account:

.. code-block:: python

    from hashlib import sha1
    from weakref import WeakKeyDictionary

    from scrapy.utils.python import to_bytes

    class RequestFingerprinter:
        cache = WeakKeyDictionary()

        def fingerprint(self, request):
            if request not in self.cache:
                fp = sha1()
                fp.update(to_bytes(request.url))
                self.cache[request] = fp.digest()
            return self.cache[request]

If you need to be able to override the request fingerprinting for arbitrary
requests from your spider callbacks, you may implement a request fingerprinter
that reads fingerprints from :attr:`request.meta <scrapy.Request.meta>`
when available, and then falls back to
:func:`scrapy.utils.request.fingerprint`. For example:

.. code-block:: python

    from scrapy.utils.request import fingerprint

    class RequestFingerprinter:
        def fingerprint(self, request):
            if "fingerprint" in request.meta:
                return request.meta["fingerprint"]
            return fingerprint(request)

If you need to reproduce the same fingerprinting algorithm as Scrapy 2.6, use
the following request fingerprinter:

.. code-block:: python

    from hashlib import sha1
    from weakref import WeakKeyDictionary

    from scrapy.utils.python import to_bytes
    from w3lib.url import canonicalize_url

    class RequestFingerprinter:
        cache = WeakKeyDictionary()

        def fingerprint(self, request):
            if request not in self.cache:
                fp = sha1()
                fp.update(to_bytes(request.method))
                fp.update(to_bytes(canonicalize_url(request.url)))
                fp.update(request.body or b"")
                self.cache[request] = fp.digest()
            return self.cache[request]

.. _request-fingerprint-restrictions:

Request fingerprint restrictions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scrapy components that use request fingerprints may impose additional
restrictions on the format of the fingerprints that your :ref:`request
fingerprinter <custom-request-fingerprinter>` generates.

The following built-in Scrapy components have such restrictions:

-   :class:`scrapy.extensions.httpcache.FilesystemCacheStorage` (default
    value of :setting:`HTTPCACHE_STORAGE`)

    Request fingerprints must be at least 1 byte long.

    Path and filename length limits of the file system of
    :setting:`HTTPCACHE_DIR` also apply. Inside :setting:`HTTPCACHE_DIR`,
    the following directory structure is created:

    -   :attr:`.Spider.name`

        -   first byte of a request fingerprint as hexadecimal

            -   fingerprint as hexadecimal

                -   filenames up to 16 characters long

    For example, if a request fingerprint is made of 20 bytes (default),
    :setting:`HTTPCACHE_DIR` is ``'/home/user/project/.scrapy/httpcache'``,
    and the name of your spider is ``'my_spider'`` your file system must
    support a file path like::

        /home/user/project/.scrapy/httpcache/my_spider/01/0123456789abcdef0123456789abcdef01234567/response_headers

-   :class:`scrapy.extensions.httpcache.DbmCacheStorage`

    The underlying DBM implementation must support keys as long as twice
    the number of bytes of a request fingerprint, plus 5. For example,
    if a request fingerprint is made of 20 bytes (default),
    45-character-long keys must be supported.

.. _topics-request-meta:

Request.meta special keys
=========================

The :attr:`.Request.meta` attribute can contain any arbitrary data, but there
are some special keys recognized by Scrapy and its built-in extensions.

Those are:

* :reqmeta:`allow_offsite`
* :reqmeta:`autothrottle_dont_adjust_delay`
* :reqmeta:`bindaddress`
* :reqmeta:`cookiejar`
* :reqmeta:`dont_cache`
* :reqmeta:`dont_merge_cookies`
* :reqmeta:`dont_obey_robotstxt`
* :reqmeta:`dont_redirect`
* :reqmeta:`dont_retry`
* :reqmeta:`download_fail_on_dataloss`
* :reqmeta:`download_latency`
* :reqmeta:`download_maxsize`
* :reqmeta:`download_warnsize`
* :reqmeta:`download_timeout`
* ``ftp_password`` (See :setting:`FTP_PASSWORD` for more info)
* ``ftp_user`` (See :setting:`FTP_USER` for more info)
* :reqmeta:`handle_httpstatus_all`
* :reqmeta:`handle_httpstatus_list`
* :reqmeta:`is_start_request`
* :reqmeta:`max_retry_times`
* :reqmeta:`proxy`
* :reqmeta:`redirect_reasons`
* :reqmeta:`redirect_urls`
* :reqmeta:`referrer_policy`

.. reqmeta:: bindaddress

bindaddress
-----------

The default local outgoing address for download-handler connections.

This meta value can be either:

- a host address as a string (e.g. ``"127.0.0.2"``), in which case the local
  port is chosen automatically, or

- a ``(host, port)`` tuple (e.g. ``("127.0.0.2", 50000)``) to bind to both a
  specific local interface and a specific local port.

For example:

.. code-block:: python

    Request(
        "https://example.org",
        meta={"bindaddress": "127.0.0.2"},
    )

.. code-block:: python

    Request(
        "https://example.org",
        meta={"bindaddress": ("127.0.0.2", 50000)},
    )

If not set, built-in HTTP download handlers use the value of
:setting:`DOWNLOAD_BIND_ADDRESS` as the default bind address.
Set the :reqmeta:`bindaddress` request meta key to override it for a
specific request.

This meta key is not supported by
:class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`, but the
:setting:`DOWNLOAD_BIND_ADDRESS` is supported by it.

.. reqmeta:: download_timeout

download_timeout
----------------

The amount of time (in secs) that the downloader will wait before timing out.
See also: :setting:`DOWNLOAD_TIMEOUT`.

.. reqmeta:: download_latency

download_latency
----------------

The amount of time spent to fetch the response, since the request has been
started, i.e. HTTP message sent over the network. This meta key only becomes
available when the response has been downloaded. While most other meta keys are
used to control Scrapy behavior, this one is supposed to be read-only.

.. reqmeta:: download_fail_on_dataloss

download_fail_on_dataloss
-------------------------

Whether or not to fail on broken responses. See:
:setting:`DOWNLOAD_FAIL_ON_DATALOSS`.

.. reqmeta:: max_retry_times

max_retry_times
---------------

The meta key is used set retry times per request. When initialized, the
:reqmeta:`max_retry_times` meta key takes higher precedence over the
:setting:`RETRY_TIMES` setting.

.. _topics-stop-response-download:

Stopping the download of a Response
===================================

Raising a :exc:`~scrapy.exceptions.StopDownload` exception from a handler for the
:class:`~scrapy.signals.bytes_received` or :class:`~scrapy.signals.headers_received`
signals will stop the download of a given response. See the following example:

.. code-block:: python

    import scrapy

    class StopSpider(scrapy.Spider):
        name = "stop"
        start_urls = ["https://docs.scrapy.org/en/latest/"]

        @classmethod
        def from_crawler(cls, crawler):
            spider = super().from_crawler(crawler)
            crawler.signals.connect(
                spider.on_bytes_received, signal=scrapy.signals.bytes_received
            )
            return spider

        def parse(self, response):
            # 'last_chars' show that the full response was not downloaded
            yield {"len": len(response.text), "last_chars": response.text[-40:]}

        def on_bytes_received(self, data, request, spider):
            raise scrapy.exceptions.StopDownload(fail=False)

which produces the following output::

    2020-05-19 17:26:12 [scrapy.core.engine] INFO: Spider opened
    2020-05-19 17:26:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2020-05-19 17:26:13 [scrapy.core.downloader.handlers.http11] DEBUG: Download stopped for <GET https://docs.scrapy.org/en/latest/> from signal handler StopSpider.on_bytes_received
    2020-05-19 17:26:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: None) ['download_stopped']
    2020-05-19 17:26:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/>
    {'len': 279, 'last_chars': 'dth, initial-scale=1.0">\n  \n  <title>Scr'}
    2020-05-19 17:26:13 [scrapy.core.engine] INFO: Closing spider (finished)

By default, resulting responses are handled by their corresponding errbacks. To
call their callback instead, like in this example, pass ``fail=False`` to the
:exc:`~scrapy.exceptions.StopDownload` exception.

.. _topics-request-response-ref-request-subclasses:

Request subclasses
==================

Here is the list of built-in :class:`~scrapy.Request` subclasses. You can also subclass
it to implement your own custom functionality.

FormRequest objects
-------------------

The FormRequest class extends the base :class:`~scrapy.Request` with functionality for
dealing with HTML forms. It uses `lxml.html forms`_  to pre-populate form
fields with form data from :class:`Response` objects.

.. _lxml.html forms: https://lxml.de/lxmlhtml.html#forms

.. currentmodule:: None

.. class:: scrapy.FormRequest(url, [formdata, ...])
    :canonical: scrapy.http.request.form.FormRequest

    The :class:`~scrapy.FormRequest` class adds a new keyword parameter to the ``__init__()`` method. The
    remaining arguments are the same as for the :class:`~scrapy.Request` class and are
    not documented here.

    :param formdata: is a dictionary (or iterable of (key, value) tuples)
       containing HTML Form data which will be url-encoded and assigned to the
       body of the request.
    :type formdata: dict or collections.abc.Iterable

    The :class:`~scrapy.FormRequest` objects support the following class method in
    addition to the standard :class:`~scrapy.Request` methods:

    .. classmethod:: from_response(response, [formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

       Returns a new :class:`~scrapy.FormRequest` object with its form field values
       pre-populated with those found in the HTML ``<form>`` element contained
       in the given response. For an example see
       :ref:`topics-request-response-ref-request-userlogin`.

       The policy is to automatically simulate a click, by default, on any form
       control that looks clickable, like a ``<input type="submit">``.  Even
       though this is quite convenient, and often the desired behaviour,
       sometimes it can cause problems which could be hard to debug. For
       example, when working with forms that are filled and/or submitted using
       javascript, the default :meth:`from_response` behaviour may not be the
       most appropriate. To disable this behaviour you can set the
       ``dont_click`` argument to ``True``. Also, if you want to change the
       control clicked (instead of disabling it) you can also use the
       ``clickdata`` argument.

       .. caution:: Using this method with select elements which have leading
          or trailing whitespace in the option values will not work due to a
          `bug in lxml`_, which should be fixed in lxml 3.8 and above.

       :param response: the response containing a HTML form which will be used
          to pre-populate the form fields
       :type response: :class:`~scrapy.http.Response` object

       :param formname: if given, the form with name attribute set to this value will be used.
       :type formname: str

       :param formid: if given, the form with id attribute set to this value will be used.
       :type formid: str

       :param formxpath: if given, the first form that matches the xpath will be used.
       :type formxpath: str

       :param formcss: if given, the first form that matches the css selector will be used.
       :type formcss: str

       :param formnumber: the number of form to use, when the response contains
          multiple forms. The first one (and also the default) is ``0``.
       :type formnumber: int

       :param formdata: fields to override in the form data. If a field was
          already present in the response ``<form>`` element, its value is
          overridden by the one passed in this parameter. If a value passed in
          this parameter is ``None``, the field will not be included in the
          request, even if it was present in the response ``<form>`` element.
       :type formdata: dict

       :param clickdata: attributes to lookup the control clicked. If it's not
         given, the form data will be submitted simulating a click on the
         first clickable element. In addition to html attributes, the control
         can be identified by its zero-based index relative to other
         submittable inputs inside the form, via the ``nr`` attribute.
       :type clickdata: dict

       :param dont_click: If True, the form data will be submitted without
         clicking in any element.
       :type dont_click: bool

       The other parameters of this class method are passed directly to the
       :class:`~scrapy.FormRequest` ``__init__()`` method.

.. currentmodule:: scrapy.http

Request usage examples
----------------------

Using FormRequest to send data via HTTP POST
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to simulate a HTML Form POST in your spider and send a couple of
key-value fields, you can return a :class:`~scrapy.FormRequest` object (from your
spider) like this:

.. skip: next
.. code-block:: python

   return [
       FormRequest(
           url="http://www.example.com/post/action",
           formdata={"name": "John Doe", "age": "27"},
           callback=self.after_post,
       )
   ]

.. _topics-request-response-ref-request-userlogin:

Using FormRequest.from_response() to simulate a user login
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is usual for web sites to provide pre-populated form fields through ``<input
type="hidden">`` elements, such as session related data or authentication
tokens (for login pages). When scraping, you'll want these fields to be
automatically pre-populated and only override a couple of them, such as the
user name and password. You can use the :meth:`.FormRequest.from_response`
method for this job. Here's an example spider which uses it:

.. code-block:: python

    import scrapy

    def authentication_failed(response):
        # TODO: Check the contents of the response and return True if it failed
        # or False if it succeeded.
        pass

    class LoginSpider(scrapy.Spider):
        name = "example.com"
        start_urls = ["http://www.example.com/users/login.php"]

        def parse(self, response):
            return scrapy.FormRequest.from_response(
                response,
                formdata={"username": "john", "password": "secret"},
                callback=self.after_login,
            )

        def after_login(self, response):
            if authentication_failed(response):
                self.logger.error("Login failed")
                return

            # continue scraping with authenticated session...

JsonRequest
-----------

The JsonRequest class extends the base :class:`~scrapy.Request` class with functionality for
dealing with JSON requests.

.. class:: JsonRequest(url, [... data, dumps_kwargs])

   The :class:`JsonRequest` class adds two new keyword parameters to the ``__init__()`` method. The
   remaining arguments are the same as for the :class:`~scrapy.Request` class and are
   not documented here.

   Using the :class:`JsonRequest` will set the ``Content-Type`` header to ``application/json``
   and ``Accept`` header to ``application/json, text/javascript, */*; q=0.01``

   :param data: is any JSON serializable object that needs to be JSON encoded and assigned to body.
      If the :attr:`~scrapy.Request.body` argument is provided this parameter will be ignored.
      If the :attr:`~scrapy.Request.body` argument is not provided and the
      ``data`` argument is provided the :attr:`~scrapy.Request.method` will be
      set to ``'POST'`` automatically.
   :type data: object

   :param dumps_kwargs: Parameters that will be passed to underlying :func:`json.dumps` method which is used to serialize
       data into JSON format.
   :type dumps_kwargs: dict

   .. autoattribute:: JsonRequest.attributes

JsonRequest usage example
-------------------------

Sending a JSON POST request with a JSON payload:

.. skip: next
.. code-block:: python

   data = {
       "name1": "value1",
       "name2": "value2",
   }
   yield JsonRequest(url="http://www.example.com/post/action", data=data)

Response objects
================

.. autoclass:: Response

    :param url: the URL of this response
    :type url: str

    :param status: the HTTP status of the response. Defaults to ``200``.
    :type status: int

    :param headers: the headers of this response. The dict values can be strings
       (for single valued headers) or lists (for multi-valued headers).
    :type headers: dict

    :param body: the response body. To access the decoded text as a string, use
       ``response.text`` from an encoding-aware
       :ref:`Response subclass <topics-request-response-ref-response-subclasses>`,
       such as :class:`TextResponse`.
    :type body: bytes

    :param flags: is a list containing the initial values for the
       :attr:`Response.flags` attribute. If given, the list will be shallow
       copied.
    :type flags: list

    :param request: the initial value of the :attr:`Response.request` attribute.
        This represents the :class:`~scrapy.Request` that generated this response.
    :type request: scrapy.Request

    :param certificate: an object representing the server's SSL certificate.
    :type certificate: twisted.internet.ssl.Certificate

    :param ip_address: The IP address of the server from which the Response originated.
    :type ip_address: :class:`ipaddress.IPv4Address` or :class:`ipaddress.IPv6Address`

    :param protocol: The protocol that was used to download the response.
        For instance: "HTTP/1.0", "HTTP/1.1", "h2"
    :type protocol: :class:`str`

    .. attribute:: Response.url

        A string containing the URL of the response.

        This attribute is read-only. To change the URL of a Response use
        :meth:`replace`.

    .. attribute:: Response.status

        An integer representing the HTTP status of the response. Example: ``200``,
        ``404``.

    .. attribute:: Response.headers

        A dictionary-like (:class:`scrapy.http.headers.Headers`) object which contains
        the response headers. Values can be accessed using
        :meth:`~scrapy.http.headers.Headers.get` to return the first header value with
        the specified name or :meth:`~scrapy.http.headers.Headers.getlist` to return
        all header values with the specified name. For example, this call will give you
        all cookies in the headers::

            response.headers.getlist('Set-Cookie')

    .. attribute:: Response.body

        The response body as bytes.

        If you want the body as a string, use :attr:`TextResponse.text` (only
        available in :class:`TextResponse` and subclasses).

        This attribute is read-only. To change the body of a Response use
        :meth:`replace`.

    .. attribute:: Response.request

        The :class:`~scrapy.Request` object that generated this response. This attribute is
        assigned in the Scrapy engine, after the response and the request have passed
        through all :ref:`Downloader Middlewares <topics-downloader-middleware>`.
        In particular, this means that:

        - HTTP redirections will create a new request from the request before
          redirection. It has the majority of the same metadata and original
          request attributes and gets assigned to the redirected response
          instead of the propagation of the original request.

        - Response.request.url doesn't always equal Response.url

        - This attribute is only available in the spider code, and in the
          :ref:`Spider Middlewares <topics-spider-middleware>`, but not in
          Downloader Middlewares (although you have the Request available there by
          other means) and handlers of the :signal:`response_downloaded` signal.

    .. attribute:: Response.meta

        A shortcut to the :attr:`~scrapy.Request.meta` attribute of the
        :attr:`Response.request` object (i.e. ``self.request.meta``).

        Unlike the :attr:`Response.request` attribute, the :attr:`Response.meta`
        attribute is propagated along redirects and retries, so you will get
        the original :attr:`.Request.meta` sent from your spider.

        .. seealso:: :attr:`.Request.meta` attribute

    .. attribute:: Response.cb_kwargs

        A shortcut to the :attr:`~scrapy.Request.cb_kwargs` attribute of the
        :attr:`Response.request` object (i.e. ``self.request.cb_kwargs``).

        Unlike the :attr:`Response.request` attribute, the
        :attr:`Response.cb_kwargs` attribute is propagated along redirects and
        retries, so you will get the original :attr:`.Request.cb_kwargs` sent from your spider.

        .. seealso:: :attr:`.Request.cb_kwargs` attribute

    .. attribute:: Response.flags

        A list that contains flags for this response. Flags are labels used for
        tagging Responses. For example: ``'cached'``, ``'redirected``', etc. And
        they're shown on the string representation of the Response (``__str__()``
        method) which is used by the engine for logging.

    .. attribute:: Response.certificate

        A :class:`twisted.internet.ssl.Certificate` object representing
        the server's SSL certificate.

        Only populated for ``https`` responses, ``None`` otherwise.

    .. attribute:: Response.ip_address

        The IP address of the server from which the Response originated.

        This attribute is currently only populated by the HTTP 1.1 download
        handler, i.e. for ``http(s)`` responses. For other handlers,
        :attr:`ip_address` is always ``None``.

    .. attribute:: Response.protocol

        The protocol that was used to download the response.
        For instance: "HTTP/1.0", "HTTP/1.1"

        This attribute is currently only populated by the HTTP download
        handlers, i.e. for ``http(s)`` responses. For other handlers,
        :attr:`protocol` is always ``None``.

    .. autoattribute:: Response.attributes

    .. method:: Response.copy()

       Returns a new Response which is a copy of this Response.

    .. method:: Response.replace([url, status, headers, body, request, flags, cls])

       Returns a Response object with the same members, except for those members
       given new values by whichever keyword arguments are specified. The
       attribute :attr:`Response.meta` is copied by default.

    .. method:: Response.urljoin(url)

        Constructs an absolute url by combining the Response's :attr:`url` with
        a possible relative url.

        This is a wrapper over :func:`~urllib.parse.urljoin`, it's merely an alias for
        making this call::

            urllib.parse.urljoin(response.url, url)

    .. automethod:: Response.follow

    .. automethod:: Response.follow_all

.. _topics-request-response-ref-response-subclasses:

Response subclasses
===================

Here is the list of available built-in Response subclasses. You can also
subclass the Response class to implement your own functionality.

TextResponse objects
--------------------

.. class:: TextResponse(url, [encoding[, ...]])

    :class:`TextResponse` objects adds encoding capabilities to the base
    :class:`Response` class, which is meant to be used only for binary data,
    such as images, sounds or any media file.

    :class:`TextResponse` objects support a new ``__init__()`` method argument, in
    addition to the base :class:`Response` objects. The remaining functionality
    is the same as for the :class:`Response` class and is not documented here.

    :param encoding: is a string which contains the encoding to use for this
       response. If you create a :class:`TextResponse` object with a string as
       body, it will be converted to bytes encoded using this encoding. If
       *encoding* is ``None`` (default), the encoding will be looked up in the
       response headers and body instead.
    :type encoding: str

    :class:`TextResponse` objects support the following attributes in addition
    to the standard :class:`Response` ones:

    .. attribute:: TextResponse.text

       Response body, as a string.

       The same as ``response.body.decode(response.encoding)``, but the
       result is cached after the first call, so you can access
       ``response.text`` multiple times without extra overhead.

       .. note::

            ``str(response.body)`` is not a correct way to convert the response
            body into a string:

            .. code-block:: pycon

                >>> str(b"body")
                "b'body'"

    .. attribute:: TextResponse.encoding

       A string with the encoding of this response. The encoding is resolved by
       trying the following mechanisms, in order:

       1. the encoding passed in the ``__init__()`` method ``encoding`` argument

       2. the encoding declared in the Content-Type HTTP header. If this
          encoding is not valid (i.e. unknown), it is ignored and the next
          resolution mechanism is tried.

       3. the encoding declared in the response body. The TextResponse class
          doesn't provide any special functionality for this. However, the
          :class:`HtmlResponse` and :class:`XmlResponse` classes do.

       4. the encoding inferred by looking at the response body. This is the more
          fragile method but also the last one tried.

    .. attribute:: TextResponse.selector

        A :class:`~scrapy.Selector` instance using the response as
        target. The selector is lazily instantiated on first access.

    .. autoattribute:: TextResponse.attributes

    :class:`TextResponse` objects support the following methods in addition to
    the standard :class:`Response` ones:

    .. method:: TextResponse.jmespath(query)

        A shortcut to ``TextResponse.selector.jmespath(query)``::

            response.jmespath('object.[*]')

    .. method:: TextResponse.xpath(query)

        A shortcut to ``TextResponse.selector.xpath(query)``::

            response.xpath('//p')

    .. method:: TextResponse.css(query)

        A shortcut to ``TextResponse.selector.css(query)``::

            response.css('p')

    .. automethod:: TextResponse.follow

    .. automethod:: TextResponse.follow_all

    .. automethod:: TextResponse.json()

        Returns a Python object from deserialized JSON document.
        The result is cached after the first call.

    .. method:: TextResponse.urljoin(url)

        Constructs an absolute url by combining the Response's base url with
        a possible relative url. The base url shall be extracted from the
        ``<base>`` tag, or just :attr:`Response.url` if there is no such
        tag.

HtmlResponse objects
--------------------

.. class:: HtmlResponse(url[, ...])

    The :class:`HtmlResponse` class is a subclass of :class:`TextResponse`
    which adds encoding auto-discovering support by looking into the HTML `meta
    http-equiv`_ attribute.  See :attr:`TextResponse.encoding`.

.. _meta http-equiv: https://www.w3schools.com/TAGS/att_meta_http_equiv.asp

XmlResponse objects
-------------------

.. class:: XmlResponse(url[, ...])

    The :class:`XmlResponse` class is a subclass of :class:`TextResponse` which
    adds encoding auto-discovering support by looking into the XML declaration
    line.  See :attr:`TextResponse.encoding`.

.. _bug in lxml: https://bugs.launchpad.net/lxml/+bug/1665241

JsonResponse objects
--------------------

.. class:: JsonResponse(url[, ...])

    The :class:`JsonResponse` class is a subclass of :class:`TextResponse`
    that is used when the response has a `JSON MIME type
    <https://mimesniff.spec.whatwg.org/#json-mime-type>`_ in its `Content-Type`
    header.


.. _topics-scheduler:

=========
Scheduler
=========

.. module:: scrapy.core.scheduler

The scheduler component receives requests from the :ref:`engine <component-engine>`
and stores them into persistent and/or non-persistent data structures.
It also gets those requests and feeds them back to the engine when it
asks for a next request to be downloaded.

Overriding the default scheduler
================================

You can use your own custom scheduler class by supplying its full
Python path in the :setting:`SCHEDULER` setting.

Minimal scheduler interface
===========================

.. autoclass:: BaseScheduler
   :members:

Default scheduler
=================

.. autoclass:: Scheduler()
   :members:
   :special-members: __init__, __len__

Priority queues
===============

.. autoclass:: scrapy.pqueues.DownloaderAwarePriorityQueue
.. autoclass:: scrapy.pqueues.ScrapyPriorityQueue


:orphan:

.. _topics-scrapyd:

=======
Scrapyd
=======

Scrapyd has been moved into a separate project.

Its documentation is now hosted at:

    https://scrapyd.readthedocs.io/en/latest/


.. _topics-selectors:

=========
Selectors
=========

When you're scraping web pages, the most common task you need to perform is
to extract data from the HTML source. There are several libraries available to
achieve this, such as:

-   `BeautifulSoup`_ is a very popular web scraping library among Python
    programmers which constructs a Python object based on the structure of the
    HTML code and also deals with bad markup reasonably well, but it has one
    drawback: it's slow.

-   `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
    API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python
    standard library.)

Scrapy comes with its own mechanism for extracting data. They're called
selectors because they "select" certain parts of the HTML document specified
either by `XPath`_ or `CSS`_ expressions.

`XPath`_ is a language for selecting nodes in XML documents, which can also be
used with HTML. `CSS`_ is a language for applying styles to HTML documents. It
defines selectors to associate those styles with specific HTML elements.

.. note::
    Scrapy Selectors is a thin wrapper around `parsel`_ library; the purpose of
    this wrapper is to provide better integration with Scrapy Response objects.

    `parsel`_ is a stand-alone web scraping library which can be used without
    Scrapy. It uses `lxml`_ library under the hood, and implements an
    easy API on top of lxml API. It means Scrapy selectors are very similar
    in speed and parsing accuracy to lxml.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: https://lxml.de/
.. _XPath: https://www.w3.org/TR/xpath/all/
.. _CSS: https://www.w3.org/TR/selectors
.. _parsel: https://parsel.readthedocs.io/en/latest/

Using selectors
===============

Constructing selectors
----------------------

.. highlight:: python

.. skip: start

Response objects expose a :class:`~scrapy.Selector` instance
on ``.selector`` attribute:

.. code-block:: pycon

    >>> response.selector.xpath("//span/text()").get()
    'good'

Querying responses using XPath and CSS is so common that responses include two
more shortcuts: ``response.xpath()`` and ``response.css()``:

.. code-block:: pycon

    >>> response.xpath("//span/text()").get()
    'good'
    >>> response.css("span::text").get()
    'good'

.. skip: end

Scrapy selectors are instances of :class:`~scrapy.Selector` class
constructed by passing either :class:`~scrapy.http.TextResponse` object or
markup as a string (in ``text`` argument).

Usually there is no need to construct Scrapy selectors manually:
``response`` object is available in Spider callbacks, so in most cases
it is more convenient to use ``response.css()`` and ``response.xpath()``
shortcuts. By using ``response.selector`` or one of these shortcuts
you can also ensure the response body is parsed only once.

But if required, it is possible to use ``Selector`` directly.
Constructing from text:

.. code-block:: pycon

    >>> from scrapy.selector import Selector
    >>> body = "<html><body><span>good</span></body></html>"
    >>> Selector(text=body).xpath("//span/text()").get()
    'good'

Constructing from response - :class:`~scrapy.http.HtmlResponse` is one of
:class:`~scrapy.http.TextResponse` subclasses:

.. code-block:: pycon

    >>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse
    >>> response = HtmlResponse(url="http://example.com", body=body, encoding="utf-8")
    >>> Selector(response=response).xpath("//span/text()").get()
    'good'

``Selector`` automatically chooses the best parsing rules
(XML vs HTML) based on input type.

Using selectors
---------------

.. invisible-code-block: python

    html_response = response = load_response(
        "https://docs.scrapy.org/en/latest/_static/selectors-sample1.html",
        "../_static/selectors-sample1.html",
    )

To explain how to use the selectors we'll use the ``Scrapy shell`` (which
provides interactive testing) and an example page located in the Scrapy
documentation server:

    https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

.. _topics-selectors-htmlcode:

For the sake of completeness, here's its full HTML code:

.. literalinclude:: https://Scrapy.readthedocs.io/en/latest/_static/selectors-sample1.html
   :language: html

.. highlight:: sh

First, let's open the shell::

    scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

Then, after the shell loads, you'll have the response available as ``response``
shell variable, and its attached selector in ``response.selector`` attribute.

Since we're dealing with HTML, the selector will automatically use an HTML parser.

.. highlight:: python

So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
page, let's construct an XPath for selecting the text inside the title tag:

.. code-block:: pycon

    >>> response.xpath("//title/text()")
    [<Selector query='//title/text()' data='Example website'>]

To actually extract the textual data, you must call the selector ``.get()``
or ``.getall()`` methods, as follows:

.. code-block:: pycon

    >>> response.xpath("//title/text()").getall()
    ['Example website']
    >>> response.xpath("//title/text()").get()
    'Example website'

``.get()`` always returns a single result; if there are several matches,
content of a first match is returned; if there are no matches, None
is returned. ``.getall()`` returns a list with all results.

Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements:

.. code-block:: pycon

    >>> response.css("title::text").get()
    'Example website'

As you can see, ``.xpath()`` and ``.css()`` methods return a
:class:`~scrapy.selector.SelectorList` instance, which is a list of new
selectors. This API can be used for quickly selecting nested data:

.. code-block:: pycon

    >>> response.css("img").xpath("@src").getall()
    ['image1_thumb.jpg',
    'image2_thumb.jpg',
    'image3_thumb.jpg',
    'image4_thumb.jpg',
    'image5_thumb.jpg']

If you want to extract only the first matched element, you can call the
selector ``.get()`` (or its alias ``.extract_first()`` commonly used in
previous Scrapy versions):

.. code-block:: pycon

    >>> response.xpath('//div[@id="images"]/a/text()').get()
    'Name: My image 1 '

It returns ``None`` if no element was found:

.. code-block:: pycon

    >>> response.xpath('//div[@id="not-exists"]/text()').get() is None
    True

A default return value can be provided as an argument, to be used instead
of ``None``:

.. code-block:: pycon

    >>> response.xpath('//div[@id="not-exists"]/text()').get(default="not-found")
    'not-found'

Instead of using e.g. ``'@src'`` XPath it is possible to query for attributes
using ``.attrib`` property of a :class:`~scrapy.Selector`:

.. code-block:: pycon

    >>> [img.attrib["src"] for img in response.css("img")]
    ['image1_thumb.jpg',
    'image2_thumb.jpg',
    'image3_thumb.jpg',
    'image4_thumb.jpg',
    'image5_thumb.jpg']

As a shortcut, ``.attrib`` is also available on SelectorList directly;
it returns attributes for the first matching element:

.. code-block:: pycon

    >>> response.css("img").attrib["src"]
    'image1_thumb.jpg'

This is most useful when only a single result is expected, e.g. when selecting
by id, or selecting unique elements on a web page:

.. code-block:: pycon

    >>> response.css("base").attrib["href"]
    'http://example.com/'

Now we're going to get the base URL and some image links:

.. code-block:: pycon

    >>> response.xpath("//base/@href").get()
    'http://example.com/'

    >>> response.css("base::attr(href)").get()
    'http://example.com/'

    >>> response.css("base").attrib["href"]
    'http://example.com/'

    >>> response.xpath('//a[contains(@href, "image")]/@href').getall()
    ['image1.html',
    'image2.html',
    'image3.html',
    'image4.html',
    'image5.html']

    >>> response.css("a[href*=image]::attr(href)").getall()
    ['image1.html',
    'image2.html',
    'image3.html',
    'image4.html',
    'image5.html']

    >>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
    ['image1_thumb.jpg',
    'image2_thumb.jpg',
    'image3_thumb.jpg',
    'image4_thumb.jpg',
    'image5_thumb.jpg']

    >>> response.css("a[href*=image] img::attr(src)").getall()
    ['image1_thumb.jpg',
    'image2_thumb.jpg',
    'image3_thumb.jpg',
    'image4_thumb.jpg',
    'image5_thumb.jpg']

.. _topics-selectors-css-extensions:

Extensions to CSS Selectors
---------------------------

Per W3C standards, `CSS selectors`_ do not support selecting text nodes
or attribute values.
But selecting these is so essential in a web scraping context
that Scrapy (parsel) implements a couple of **non-standard pseudo-elements**:

* to select text nodes, use ``::text``
* to select attribute values, use ``::attr(name)`` where *name* is the
  name of the attribute that you want the value of

.. warning::
    These pseudo-elements are Scrapy-/Parsel-specific.
    They will most probably not work with other libraries like
    `lxml`_ or `PyQuery`_.

.. _PyQuery: https://pypi.org/project/pyquery/

Examples:

* ``title::text`` selects children text nodes of a descendant ``<title>`` element:

.. code-block:: pycon

    >>> response.css("title::text").get()
    'Example website'

* ``*::text`` selects all descendant text nodes of the current selector context:

..skip: next
.. code-block:: pycon

    >>> response.css("#images *::text").getall()
    ['\n   ',
    'Name: My image 1 ',
    '\n   ',
    'Name: My image 2 ',
    '\n   ',
    'Name: My image 3 ',
    '\n   ',
    'Name: My image 4 ',
    '\n   ',
    'Name: My image 5 ',
    '\n  ']

* ``foo::text`` returns no results if ``foo`` element exists, but contains
  no text (i.e. text is empty):

.. code-block:: pycon

  >>> response.css("img::text").getall()
  []

  This means ``.css('foo::text').get()`` could return None even if an element
  exists. Use ``default=''`` if you always want a string:

.. code-block:: pycon

    >>> response.css("img::text").get()
    >>> response.css("img::text").get(default="")
    ''

* ``a::attr(href)`` selects the *href* attribute value of descendant links:

.. code-block:: pycon

    >>> response.css("a::attr(href)").getall()
    ['image1.html',
    'image2.html',
    'image3.html',
    'image4.html',
    'image5.html']

.. note::
    See also: :ref:`selecting-attributes`.

.. note::
    You cannot chain these pseudo-elements. But in practice it would not
    make much sense: text nodes do not have attributes, and attribute values
    are string values already and do not have children nodes.

.. _CSS Selectors: https://www.w3.org/TR/selectors-3/#selectors

.. _topics-selectors-nesting-selectors:

Nesting selectors
-----------------

The selection methods (``.xpath()`` or ``.css()``) return a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here's an example:

.. code-block:: pycon

    >>> links = response.xpath('//a[contains(@href, "image")]')
    >>> links.getall()
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']

    >>> for index, link in enumerate(links):
    ...     href_xpath = link.xpath("@href").get()
    ...     img_xpath = link.xpath("img/@src").get()
    ...     print(f"Link number {index} points to url {href_xpath!r} and image {img_xpath!r}")
    ...
    Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
    Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
    Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
    Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
    Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

.. _selecting-attributes:

Selecting element attributes
----------------------------

There are several ways to get a value of an attribute. First, one can use
XPath syntax:

.. code-block:: pycon

    >>> response.xpath("//a/@href").getall()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

XPath syntax has a few advantages: it is a standard XPath feature, and
``@attributes`` can be used in other parts of an XPath expression - e.g.
it is possible to filter by attribute value.

Scrapy also provides an extension to CSS selectors (``::attr(...)``)
which allows to get attribute values:

.. code-block:: pycon

    >>> response.css("a::attr(href)").getall()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In addition to that, there is a ``.attrib`` property of Selector.
You can use it if you prefer to lookup attributes in Python
code, without using XPaths or CSS extensions:

.. code-block:: pycon

    >>> [a.attrib["href"] for a in response.css("a")]
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

This property is also available on SelectorList; it returns a dictionary
with attributes of a first matching element. It is convenient to use when
a selector is expected to give a single result (e.g. when selecting by element
ID, or when selecting an unique element on a page):

.. code-block:: pycon

    >>> response.css("base").attrib
    {'href': 'http://example.com/'}
    >>> response.css("base").attrib["href"]
    'http://example.com/'

``.attrib`` property of an empty SelectorList is empty:

.. code-block:: pycon

    >>> response.css("foo").attrib
    {}

Using selectors with regular expressions
----------------------------------------

:class:`~scrapy.Selector` also has a ``.re()`` method for extracting
data using regular expressions. However, unlike using ``.xpath()`` or
``.css()`` methods, ``.re()`` returns a list of strings. So you
can't construct nested ``.re()`` calls.

Here's an example used to extract image names from the :ref:`HTML code
<topics-selectors-htmlcode>` above:

.. code-block:: pycon

    >>> response.xpath('//a[contains(@href, "image")]/text()').re(r"Name:\s*(.*)")
    ['My image 1 ',
    'My image 2 ',
    'My image 3 ',
    'My image 4 ',
    'My image 5 ']

There's an additional helper reciprocating ``.get()`` (and its
alias ``.extract_first()``) for ``.re()``, named ``.re_first()``.
Use it to extract just the first matching string:

.. code-block:: pycon

    >>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r"Name:\s*(.*)")
    'My image 1 '

.. _old-extraction-api:

extract() and extract_first()
-----------------------------

If you're a long-time Scrapy user, you're probably familiar
with ``.extract()`` and ``.extract_first()`` selector methods. Many blog posts
and tutorials are using them as well. These methods are still supported
by Scrapy, there are **no plans** to deprecate them.

However, Scrapy usage docs are now written using ``.get()`` and
``.getall()`` methods. We feel that these new methods result in a more concise
and readable code.

The following examples show how these methods map to each other.

1.  ``SelectorList.get()`` is the same as ``SelectorList.extract_first()``:

.. code-block:: pycon

    >>> response.css("a::attr(href)").get()
    'image1.html'
    >>> response.css("a::attr(href)").extract_first()
    'image1.html'

2.  ``SelectorList.getall()`` is the same as ``SelectorList.extract()``:

.. code-block:: pycon

    >>> response.css("a::attr(href)").getall()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    >>> response.css("a::attr(href)").extract()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

3.  ``Selector.get()`` is the same as ``Selector.extract()``:

.. code-block:: pycon

    >>> response.css("a::attr(href)")[0].get()
    'image1.html'
    >>> response.css("a::attr(href)")[0].extract()
    'image1.html'

4.  For consistency, there is also ``Selector.getall()``, which returns a list:

.. code-block:: pycon

    >>> response.css("a::attr(href)")[0].getall()
    ['image1.html']

So, the main difference is that output of ``.get()`` and ``.getall()`` methods
is more predictable: ``.get()`` always returns a single result, ``.getall()``
always returns a list of all extracted results. With ``.extract()`` method
it was not always obvious if a result is a list or not; to get a single
result either ``.extract()`` or ``.extract_first()`` should be called.

.. _topics-selectors-xpaths:

Working with XPaths
===================

Here are some tips which may help you to use XPath with Scrapy selectors
effectively. If you are not much familiar with XPath yet,
you may want to take a look first at this `XPath tutorial`_.

.. note::
    Some of the tips are based on `this post from Zyte's blog`_.

.. _XPath tutorial: http://www.zvon.org/comp/r/tut-XPath_1.html
.. _this post from Zyte's blog: https://www.zyte.com/blog/xpath-tips-from-the-web-scraping-trenches/

.. _topics-selectors-relative-xpaths:

Working with relative XPaths
----------------------------

Keep in mind that if you are nesting selectors and use an XPath that starts
with ``/``, that XPath will be absolute to the document and not relative to the
``Selector`` you're calling it from.

For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
elements. First, you would get all ``<div>`` elements:

.. code-block:: pycon

    >>> divs = response.xpath("//div")

At first, you may be tempted to use the following approach, which is wrong, as
it actually extracts all ``<p>`` elements from the document, not only those
inside ``<div>`` elements:

.. code-block:: pycon

    >>> for p in divs.xpath("//p"):  # this is wrong - gets all <p> from the whole document
    ...     print(p.get())
    ...

This is the proper way to do it (note the dot prefixing the ``.//p`` XPath):

.. code-block:: pycon

    >>> for p in divs.xpath(".//p"):  # extracts all <p> inside
    ...     print(p.get())
    ...

Another common case would be to extract all direct ``<p>`` children:

.. code-block:: pycon

    >>> for p in divs.xpath("p"):
    ...     print(p.get())
    ...

For more details about relative XPaths see the `Location Paths`_ section in the
XPath specification.

.. _Location Paths: https://www.w3.org/TR/xpath-10/#location-paths

When querying by class, consider using CSS
------------------------------------------

Because an element can contain multiple CSS classes, the XPath way to select elements
by class is the rather verbose::

    *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

If you use ``@class='someclass'`` you may end up missing elements that have
other classes, and if you just use ``contains(@class, 'someclass')`` to make up
for that you may end up with more elements that you want, if they have a different
class name that shares the string ``someclass``.

As it turns out, Scrapy selectors allow you to chain selectors, so most of the time
you can just select by class using CSS and then switch to XPath when needed:

.. code-block:: pycon

    >>> from scrapy import Selector
    >>> sel = Selector(
    ...     text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>'
    ... )
    >>> sel.css(".shout").xpath("./time/@datetime").getall()
    ['2014-07-23 19:00']

This is cleaner than using the verbose XPath trick shown above. Just remember
to use the ``.`` in the XPath expressions that will follow.

Beware of the difference between //node[1] and (//node)[1]
----------------------------------------------------------

``//node[1]`` selects all the nodes occurring first under their respective parents.

``(//node)[1]`` selects all the nodes in the document, and then gets only the first of them.

Example:

.. code-block:: pycon

    >>> from scrapy import Selector
    >>> sel = Selector(
    ...     text="""
    ...     <ul class="list">
    ...         <li>1</li>
    ...         <li>2</li>
    ...         <li>3</li>
    ...     </ul>
    ...     <ul class="list">
    ...         <li>4</li>
    ...         <li>5</li>
    ...         <li>6</li>
    ...     </ul>"""
    ... )
    >>> xp = lambda x: sel.xpath(x).getall()

This gets all first ``<li>``  elements under whatever it is its parent:

.. code-block:: pycon

    >>> xp("//li[1]")
    ['<li>1</li>', '<li>4</li>']

And this gets the first ``<li>``  element in the whole document:

.. code-block:: pycon

    >>> xp("(//li)[1]")
    ['<li>1</li>']

This gets all first ``<li>``  elements under an ``<ul>``  parent:

.. code-block:: pycon

    >>> xp("//ul/li[1]")
    ['<li>1</li>', '<li>4</li>']

And this gets the first ``<li>``  element under an ``<ul>``  parent in the whole document:

.. code-block:: pycon

    >>> xp("(//ul/li)[1]")
    ['<li>1</li>']

Using text nodes in a condition
-------------------------------

When you need to use the text content as argument to an `XPath string function`_,
avoid using ``.//text()`` and use just ``.`` instead.

This is because the expression ``.//text()`` yields a collection of text elements -- a *node-set*.
And when a node-set is converted to a string, which happens when it is passed as argument to
a string function like ``contains()`` or ``starts-with()``, it results in the text for the first element only.

Example:

.. code-block:: pycon

    >>> from scrapy import Selector
    >>> sel = Selector(
    ...     text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
    ... )

Converting a *node-set* to string:

.. code-block:: pycon

    >>> sel.xpath("//a//text()").getall()  # take a peek at the node-set
    ['Click here to go to the ', 'Next Page']
    >>> sel.xpath("string(//a[1]//text())").getall()  # convert it to string
    ['Click here to go to the ']

A *node* converted to a string, however, puts together the text of itself plus of all its descendants:

.. code-block:: pycon

    >>> sel.xpath("//a[1]").getall()  # select the first node
    ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
    >>> sel.xpath("string(//a[1])").getall()  # convert it to string
    ['Click here to go to the Next Page']

So, using the ``.//text()`` node-set won't select anything in this case:

.. code-block:: pycon

    >>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
    []

But using the ``.`` to mean the node, works:

.. code-block:: pycon

    >>> sel.xpath("//a[contains(., 'Next Page')]").getall()
    ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

.. _XPath string function: https://www.w3.org/TR/xpath-10/#section-String-Functions

.. _topics-selectors-xpath-variables:

Variables in XPath expressions
------------------------------

XPath allows you to reference variables in your XPath expressions, using
the ``$somevariable`` syntax. This is somewhat similar to parameterized
queries or prepared statements in the SQL world where you replace
some arguments in your queries with placeholders like ``?``,
which are then substituted with values passed with the query.

Here's an example to match an element based on its "id" attribute value,
without hard-coding it (that was shown previously):

.. code-block:: pycon

    >>> # `$val` used in the expression, a `val` argument needs to be passed
    >>> response.xpath("//div[@id=$val]/a/text()", val="images").get()
    'Name: My image 1 '

Here's another example, to find the "id" attribute of a ``<div>`` tag containing
five ``<a>`` children (here we pass the value ``5`` as an integer):

.. code-block:: pycon

    >>> response.xpath("//div[count(a)=$cnt]/@id", cnt=5).get()
    'images'

All variable references must have a binding value when calling ``.xpath()``
(otherwise you'll get a ``ValueError: XPath error:`` exception).
This is done by passing as many named arguments as necessary.

`parsel`_, the library powering Scrapy selectors, has more details and examples
on `XPath variables`_.

.. _XPath variables: https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions

.. _removing-namespaces:

Removing namespaces
-------------------

.. skip: start

When dealing with scraping projects, it is often quite convenient to get rid of
namespaces altogether and just work with element names, to write more
simple/convenient XPaths. You can use the
:meth:`.Selector.remove_namespaces` method for that.

Let's show an example that illustrates this with the Python Insider blog atom feed.

.. highlight:: sh

First, we open the shell with the url we want to scrape::

    $ scrapy shell https://feeds.feedburner.com/PythonInsider

This is how the file starts::

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet ...
    <feed xmlns="http://www.w3.org/2005/Atom"
          xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
          xmlns:blogger="http://schemas.google.com/blogger/2008"
          xmlns:georss="http://www.georss.org/georss"
          xmlns:gd="http://schemas.google.com/g/2005"
          xmlns:thr="http://purl.org/syndication/thread/1.0"
          xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
      ...

You can see several namespace declarations including a default
``"http://www.w3.org/2005/Atom"`` and another one using the ``gd:`` prefix for
``"http://schemas.google.com/g/2005"``.

.. highlight:: python

Once in the shell we can try selecting all ``<link>`` objects and see that it
doesn't work (because the Atom XML namespace is obfuscating those nodes):

.. code-block:: pycon

    >>> response.xpath("//link")
    []

But once we call the :meth:`.Selector.remove_namespaces` method, all
nodes can be accessed directly by their names:

.. code-block:: pycon

    >>> response.selector.remove_namespaces()
    >>> response.xpath("//link")
    [<Selector query='//link' data='<link rel="alternate" type="text/html" h'>,
        <Selector query='//link' data='<link rel="next" type="application/atom+'>,
        ...

If you wonder why the namespace removal procedure isn't always called by default
instead of having to call it manually, this is because of two reasons, which, in order
of relevance, are:

1. Removing namespaces requires to iterate and modify all nodes in the
   document, which is a reasonably expensive operation to perform by default
   for all documents crawled by Scrapy

2. There could be some cases where using namespaces is actually required, in
   case some element names clash between namespaces. These cases are very rare
   though.

.. skip: end

Using EXSLT extensions
----------------------

Being built atop `lxml`_, Scrapy selectors support some `EXSLT`_ extensions
and come with these pre-registered namespaces to use in XPath expressions:

======  =====================================    =======================
prefix  namespace                                usage
======  =====================================    =======================
re      \http://exslt.org/regular-expressions    `regular expressions`_
set     \http://exslt.org/sets                   `set manipulation`_
======  =====================================    =======================

Regular expressions
~~~~~~~~~~~~~~~~~~~

The ``test()`` function, for example, can prove quite useful when XPath's
``starts-with()`` or ``contains()`` are not sufficient.

Example selecting links in list item with a "class" attribute ending with a digit:

.. code-block:: pycon

    >>> from scrapy import Selector
    >>> doc = """
    ... <div>
    ...     <ul>
    ...         <li class="item-0"><a href="link1.html">first item</a></li>
    ...         <li class="item-1"><a href="link2.html">second item</a></li>
    ...         <li class="item-inactive"><a href="link3.html">third item</a></li>
    ...         <li class="item-1"><a href="link4.html">fourth item</a></li>
    ...         <li class="item-0"><a href="link5.html">fifth item</a></li>
    ...     </ul>
    ... </div>
    ... """
    >>> sel = Selector(text=doc, type="html")
    >>> sel.xpath("//li//@href").getall()
    ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
    >>> sel.xpath(r'//li[re:test(@class, "item-\d$")]//@href').getall()
    ['link1.html', 'link2.html', 'link4.html', 'link5.html']

.. warning:: C library ``libxslt`` doesn't natively support EXSLT regular
    expressions so `lxml`_'s implementation uses hooks to Python's ``re`` module.
    Thus, using regexp functions in your XPath expressions may add a small
    performance penalty.

Set operations
~~~~~~~~~~~~~~

These can be handy for excluding parts of a document tree before
extracting text elements for example.

Example extracting microdata (sample content taken from https://schema.org/Product)
with groups of itemscopes and corresponding itemprops:

.. skip: next

.. code-block:: pycon

    >>> doc = """
    ... <div itemscope itemtype="http://schema.org/Product">
    ...   <span itemprop="name">Kenmore White 17" Microwave</span>
    ...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
    ...   <div itemprop="aggregateRating"
    ...     itemscope itemtype="http://schema.org/AggregateRating">
    ...    Rated <span itemprop="ratingValue">3.5</span>/5
    ...    based on <span itemprop="reviewCount">11</span> customer reviews
    ...   </div>
    ...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...     <span itemprop="price">$55.00</span>
    ...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
    ...   </div>
    ...   Product description:
    ...   <span itemprop="description">0.7 cubic feet countertop microwave.
    ...   Has six preset cooking categories and convenience features like
    ...   Add-A-Minute and Child Lock.</span>
    ...   Customer reviews:
    ...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...     <span itemprop="name">Not a happy camper</span> -
    ...     by <span itemprop="author">Ellie</span>,
    ...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
    ...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...       <meta itemprop="worstRating" content = "1">
    ...       <span itemprop="ratingValue">1</span>/
    ...       <span itemprop="bestRating">5</span>stars
    ...     </div>
    ...     <span itemprop="description">The lamp burned out and now I have to replace
    ...     it. </span>
    ...   </div>
    ...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...     <span itemprop="name">Value purchase</span> -
    ...     by <span itemprop="author">Lucas</span>,
    ...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
    ...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...       <meta itemprop="worstRating" content = "1"/>
    ...       <span itemprop="ratingValue">4</span>/
    ...       <span itemprop="bestRating">5</span>stars
    ...     </div>
    ...     <span itemprop="description">Great microwave for the price. It is small and
    ...     fits in my apartment.</span>
    ...   </div>
    ...   ...
    ... </div>
    ... """
    >>> sel = Selector(text=doc, type="html")
    >>> for scope in sel.xpath("//div[@itemscope]"):
    ...     print("current scope:", scope.xpath("@itemtype").getall())
    ...     props = scope.xpath(
    ...         """
    ...                 set:difference(./descendant::*/@itemprop,
    ...                                .//*[@itemscope]/*/@itemprop)"""
    ...     )
    ...     print(f"    properties: {props.getall()}")
    ...     print("")
    ...

    current scope: ['http://schema.org/Product']
        properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']

    current scope: ['http://schema.org/AggregateRating']
        properties: ['ratingValue', 'reviewCount']

    current scope: ['http://schema.org/Offer']
        properties: ['price', 'availability']

    current scope: ['http://schema.org/Review']
        properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

    current scope: ['http://schema.org/Rating']
        properties: ['worstRating', 'ratingValue', 'bestRating']

    current scope: ['http://schema.org/Review']
        properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

    current scope: ['http://schema.org/Rating']
        properties: ['worstRating', 'ratingValue', 'bestRating']

Here we first iterate over ``itemscope`` elements, and for each one,
we look for all ``itemprops`` elements and exclude those that are themselves
inside another ``itemscope``.

.. _EXSLT: https://exslt.github.io/
.. _regular expressions: https://exslt.github.io/regexp/index.html
.. _set manipulation: https://exslt.github.io/set/index.html

Other XPath extensions
----------------------

Scrapy selectors also provide a sorely missed XPath extension function
``has-class`` that returns ``True`` for nodes that have all of the specified
HTML classes.

For the following HTML:

.. code-block:: pycon

    >>> from scrapy.http import HtmlResponse
    >>> response = HtmlResponse(
    ...     url="http://example.com",
    ...     body="""
    ... <html>
    ...     <body>
    ...         <p class="foo bar-baz">First</p>
    ...         <p class="foo">Second</p>
    ...         <p class="bar">Third</p>
    ...         <p>Fourth</p>
    ...     </body>
    ... </html>
    ... """,
    ...     encoding="utf-8",
    ... )

You can use it like this:

.. code-block:: pycon

    >>> response.xpath('//p[has-class("foo")]')
    [<Selector query='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
    <Selector query='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
    >>> response.xpath('//p[has-class("foo", "bar-baz")]')
    [<Selector query='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
    >>> response.xpath('//p[has-class("foo", "bar")]')
    []

So XPath ``//p[has-class("foo", "bar-baz")]`` is roughly equivalent to CSS
``p.foo.bar-baz``.  Please note, that it is slower in most of the cases,
because it's a pure-Python function that's invoked for every node in question
whereas the CSS lookup is translated into XPath and thus runs more efficiently,
so performance-wise its uses are limited to situations that are not easily
described with CSS selectors.

Parsel also simplifies adding your own XPath extensions with
:func:`~parsel.xpathfuncs.set_xpathfunc`.

.. _topics-selectors-ref:

Built-in Selectors reference
============================

.. module:: scrapy.selector
   :synopsis: Selector class

Selector objects
----------------

.. autoclass:: scrapy.Selector

  .. automethod:: xpath

      .. note::

          For convenience, this method can be called as ``response.xpath()``

  .. automethod:: css

      .. note::

          For convenience, this method can be called as ``response.css()``

  .. automethod:: jmespath

      .. note::

          For convenience, this method can be called as ``response.jmespath()``

  .. automethod:: get

     See also: :ref:`old-extraction-api`

  .. autoattribute:: attrib

     See also: :ref:`selecting-attributes`.

  .. automethod:: re

  .. automethod:: re_first

  .. automethod:: register_namespace

  .. automethod:: remove_namespaces

  .. automethod:: __bool__

  .. automethod:: getall

     This method is added to Selector for consistency; it is more useful
     with SelectorList. See also: :ref:`old-extraction-api`

SelectorList objects
--------------------

.. autoclass:: SelectorList

   .. automethod:: xpath

   .. automethod:: css

   .. automethod:: jmespath

   .. automethod:: getall

      See also: :ref:`old-extraction-api`

   .. automethod:: get

      See also: :ref:`old-extraction-api`

   .. automethod:: re

   .. automethod:: re_first

   .. autoattribute:: attrib

      See also: :ref:`selecting-attributes`.

.. _selector-examples:

Examples
========

.. _selector-examples-html:

Selector examples on HTML response
----------------------------------

Here are some :class:`~scrapy.Selector` examples to illustrate several concepts.
In all cases, we assume there is already a :class:`~scrapy.Selector` instantiated with
a :class:`~scrapy.http.HtmlResponse` object like this:

.. code-block:: python

      sel = Selector(html_response)

1. Select all ``<h1>`` elements from an HTML response body, returning a list of
   :class:`~scrapy.Selector` objects (i.e. a :class:`SelectorList` object):

   .. code-block:: python

      sel.xpath("//h1")

2. Extract the text of all ``<h1>`` elements from an HTML response body,
   returning a list of strings:

   .. code-block:: python

      sel.xpath("//h1").getall()  # this includes the h1 tag
      sel.xpath("//h1/text()").getall()  # this excludes the h1 tag

3. Iterate over all ``<p>`` tags and print their class attribute:

   .. code-block:: python

      for node in sel.xpath("//p"):
          print(node.attrib["class"])

.. _selector-examples-xml:

Selector examples on XML response
---------------------------------

.. skip: start

Here are some examples to illustrate concepts for :class:`~scrapy.Selector` objects
instantiated with an :class:`~scrapy.http.XmlResponse` object:

.. code-block:: python

      sel = Selector(xml_response)

1. Select all ``<product>`` elements from an XML response body, returning a list
   of :class:`~scrapy.Selector` objects (i.e. a :class:`SelectorList` object):

   .. code-block:: python

      sel.xpath("//product")

2. Extract all prices from a `Google Base XML feed`_ which requires registering
   a namespace:

   .. code-block:: python

      sel.register_namespace("g", "http://base.google.com/ns/1.0")
      sel.xpath("//g:price").getall()

.. skip: end

.. _Google Base XML feed: https://support.google.com/merchants/answer/14987622


.. _topics-settings:

========
Settings
========

The Scrapy settings allows you to customize the behaviour of all Scrapy
components, including the core, extensions, pipelines and spiders themselves.

The infrastructure of the settings provides a global namespace of key-value mappings
that the code can use to pull configuration values from. The settings can be
populated through different mechanisms, which are described below.

The settings are also the mechanism for selecting the currently active Scrapy
project (in case you have many).

For a list of available built-in settings see: :ref:`topics-settings-ref`.

.. _topics-settings-module-envvar:

Designating the settings
========================

When you use Scrapy, you have to tell it which settings you're using. You can
do this by using an environment variable, ``SCRAPY_SETTINGS_MODULE``.

The value of ``SCRAPY_SETTINGS_MODULE`` should be in Python path syntax, e.g.
``myproject.settings``. Note that the settings module should be on the
Python :ref:`import search path <tut-searchpath>`.

.. _populating-settings:

Populating the settings
=======================

Settings can be populated using different mechanisms, each of which has a
different precedence:

 1. :ref:`Command-line settings <cli-settings>` (highest precedence)
 2. :ref:`Spider settings <spider-settings>`
 3. :ref:`Project settings <project-settings>`
 4. :ref:`Add-on settings <addon-settings>`
 5. :ref:`Command-specific default settings <cmd-default-settings>`
 6. :ref:`Global default settings <default-settings>` (lowest precedence)

.. _cli-settings:

1. Command-line settings
------------------------

Settings set in the command line have the highest precedence, overriding any
other settings.

You can explicitly override one or more settings using the ``-s`` (or
``--set``) command-line option.

.. highlight:: sh

Example::

    scrapy crawl myspider -s LOG_LEVEL=INFO -s LOG_FILE=scrapy.log

.. _spider-settings:

2. Spider settings
------------------

:ref:`Spiders <topics-spiders>` can define their own settings that will take
precedence and override the project ones.

.. note:: :ref:`Pre-crawler settings <pre-crawler-settings>` cannot be defined
    per spider, and :ref:`reactor settings <reactor-settings>` should not have
    a different value per spider when :ref:`running multiple spiders in the
    same process <run-multiple-spiders>`.

One way to do so is by setting their :attr:`~scrapy.Spider.custom_settings`
attribute:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"

        custom_settings = {
            "SOME_SETTING": "some value",
        }

It's often better to implement :meth:`~scrapy.Spider.update_settings` instead,
and settings set there should use the ``"spider"`` priority explicitly:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"

        @classmethod
        def update_settings(cls, settings):
            super().update_settings(settings)
            settings.set("SOME_SETTING", "some value", priority="spider")

.. versionadded:: 2.11

It's also possible to modify the settings in the
:meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider
arguments <spiderargs>` or other logic:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"

        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super().from_crawler(crawler, *args, **kwargs)
            if "some_argument" in kwargs:
                spider.settings.set(
                    "SOME_SETTING", kwargs["some_argument"], priority="spider"
                )
            return spider

.. _project-settings:

3. Project settings
-------------------

Scrapy projects include a settings module, usually a file called
``settings.py``, where you should populate most settings that apply to all your
spiders.

.. seealso:: :ref:`topics-settings-module-envvar`

.. _addon-settings:

4. Add-on settings
------------------

:ref:`Add-ons <topics-addons>` can modify settings. They should do this with
``"addon"`` priority where possible.

.. _cmd-default-settings:

5. Command-specific default settings
------------------------------------

Each :ref:`Scrapy command <topics-commands>` can have its own default settings,
which override the :ref:`global default settings <default-settings>`.

Those command-specific default settings are specified in the
``default_settings`` attribute of each command class.

.. _default-settings:

6. Default global settings
--------------------------

The ``scrapy.settings.default_settings`` module defines global default values
for some :ref:`built-in settings <topics-settings-ref>`.

.. note:: :command:`startproject` generates a ``settings.py`` file that sets
    some settings to different values.

    The reference documentation of settings indicates the default value if one
    exists. If :command:`startproject` sets a value, that value is documented
    as default, and the value from ``scrapy.settings.default_settings`` is
    documented as “fallback”.

Compatibility with pickle
=========================

Setting values must be :ref:`picklable <pickle-picklable>`.

Import paths and classes
========================

When a setting references a callable object to be imported by Scrapy, such as a
class or a function, there are two different ways you can specify that object:

-   As a string containing the import path of that object

-   As the object itself

For example:

.. skip: next
.. code-block:: python

   from mybot.pipelines.validate import ValidateMyItem

   ITEM_PIPELINES = {
       # passing the classname...
       ValidateMyItem: 300,
       # ...equals passing the class path
       "mybot.pipelines.validate.ValidateMyItem": 300,
   }

.. note:: Passing non-callable objects is not supported.

How to access settings
======================

.. highlight:: python

In a spider, settings are available through ``self.settings``:

.. code-block:: python

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = ["http://example.com"]

        def parse(self, response):
            print(f"Existing settings: {self.settings.attributes.keys()}")

.. note::
    The ``settings`` attribute is set in the base Spider class after the spider
    is initialized.  If you want to use settings before the initialization
    (e.g., in your spider's ``__init__()`` method), you'll need to override the
    :meth:`~scrapy.Spider.from_crawler` method.

:ref:`Components <topics-components>` can also :ref:`access settings
<component-settings>`.

The ``settings`` object can be used like a :class:`dict` (e.g.
``settings["LOG_ENABLED"]``). However, to support non-string setting values,
which may be passed from the command line as strings, it is recommended to use
one of the methods provided by the :class:`~scrapy.settings.Settings` API.

.. _component-priority-dictionaries:

Component priority dictionaries
===============================

A **component priority dictionary** is a :class:`dict` where keys are
:ref:`components <topics-components>` and values are component priorities. For
example:

.. skip: next
.. code-block:: python

    {
        "path.to.ComponentA": None,
        ComponentB: 100,
    }

A component can be specified either as a class object or through an import
path.

.. warning:: Component priority dictionaries are regular :class:`dict` objects.
    Be careful not to define the same component more than once, e.g. with
    different import path strings or defining both an import path and a
    :class:`type` object.

A priority can be an :class:`int` or :data:`None`.

A component with priority 1 goes *before* a component with priority 2. What
going before entails, however, depends on the corresponding setting. For
example, in the :setting:`DOWNLOADER_MIDDLEWARES` setting, components have
their
:meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_request`
method executed before that of later components, but have their
:meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response`
method executed after that of later components.

A component with priority :data:`None` is disabled.

Some component priority dictionaries get merged with some built-in value. For
example, :setting:`DOWNLOADER_MIDDLEWARES` is merged with
:setting:`DOWNLOADER_MIDDLEWARES_BASE`. This is where :data:`None` comes in
handy, allowing you to disable a component from the base setting in the regular
setting:

.. code-block:: python

    DOWNLOADER_MIDDLEWARES = {
        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
    }

Special settings
================

The following settings work slightly differently than all other settings.

.. _pre-crawler-settings:

Pre-crawler settings
--------------------

**Pre-crawler settings** are settings used before the
:class:`~scrapy.crawler.Crawler` object is created.

These settings cannot be :ref:`set from a spider <spider-settings>`.

These settings are:

-   :setting:`TWISTED_REACTOR_ENABLED`
-   :setting:`SPIDER_LOADER_CLASS` and settings used by the corresponding
    spider loader class, e.g. :setting:`SPIDER_MODULES` and
    :setting:`SPIDER_LOADER_WARN_ONLY` for the default spider loader class.

.. _reactor-settings:

Reactor settings
----------------

**Reactor settings** are settings tied to the :doc:`Twisted reactor
<twisted:core/howto/reactor-basics>`.

These settings can be defined from a spider. However, because only 1 reactor
can be used per process, these settings cannot use a different value per spider
when :ref:`running multiple spiders in the same process
<run-multiple-spiders>`.

In general, if different spiders define different values, the first defined
value is used. However, if two spiders request a different reactor, an
exception is raised.

These settings are:

-   :setting:`ASYNCIO_EVENT_LOOP` (not possible to set per-spider when using
    :class:`~scrapy.crawler.AsyncCrawlerProcess`, see below)

-   :setting:`TWISTED_DNS_RESOLVER` and settings used by the corresponding
    component, e.g. :setting:`DNSCACHE_ENABLED`, :setting:`DNSCACHE_SIZE`
    and :setting:`DNS_TIMEOUT` for the default one.

-   :setting:`REACTOR_THREADPOOL_MAXSIZE`

-   :setting:`TWISTED_REACTOR` (ignored when using
    :class:`~scrapy.crawler.AsyncCrawlerProcess`, see below)

:setting:`ASYNCIO_EVENT_LOOP` and :setting:`TWISTED_REACTOR` are used upon
installing the reactor. The rest of the settings are applied when starting
the reactor.

There is an additional restriction for :setting:`TWISTED_REACTOR` and
:setting:`ASYNCIO_EVENT_LOOP` when using
:class:`~scrapy.crawler.AsyncCrawlerProcess`: when this class is instantiated,
it installs :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`,
ignoring the value of :setting:`TWISTED_REACTOR` and using the value of
:setting:`ASYNCIO_EVENT_LOOP` that was passed to
:meth:`AsyncCrawlerProcess.__init__()
<scrapy.crawler.AsyncCrawlerProcess.__init__>`. If a different value for
:setting:`TWISTED_REACTOR` or :setting:`ASYNCIO_EVENT_LOOP` is provided later,
e.g. in :ref:`per-spider settings <spider-settings>`, an exception will be
raised.

All of these settings, except for :setting:`ASYNCIO_EVENT_LOOP`, are only used
when the Twisted reactor is used, i.e. when :setting:`TWISTED_REACTOR_ENABLED`
is ``True``.

.. _topics-settings-ref:

Built-in settings reference
===========================

Here's a list of all available Scrapy settings, in alphabetical order, along
with their default values and the scope where they apply.

The scope, where available, shows where the setting is being used, if it's tied
to any particular component. In that case the module of that component will be
shown, typically an extension, middleware or pipeline. It also means that the
component must be enabled in order for the setting to have any effect.

.. setting:: ADDONS

ADDONS
------

Default: ``{}``

A dict containing paths to the add-ons enabled in your project and their
priorities. For more information, see :ref:`topics-addons`.

.. setting:: AWS_ACCESS_KEY_ID

AWS_ACCESS_KEY_ID
-----------------

Default: ``None``

The AWS access key used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.

.. setting:: AWS_SECRET_ACCESS_KEY

AWS_SECRET_ACCESS_KEY
---------------------

Default: ``None``

The AWS secret key used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.

.. setting:: AWS_SESSION_TOKEN

AWS_SESSION_TOKEN
-----------------

Default: ``None``

The AWS security token used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`, when using
`temporary security credentials`_.

.. _temporary security credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html

.. setting:: AWS_ENDPOINT_URL

AWS_ENDPOINT_URL
----------------

Default: ``None``

Endpoint URL used for S3-like storage, for example Minio or s3.scality.

.. setting:: AWS_USE_SSL

AWS_USE_SSL
-----------

Default: ``None``

Use this option if you want to disable SSL connection for communication with
S3 or S3-like storage. By default SSL will be used.

.. setting:: AWS_VERIFY

AWS_VERIFY
----------

Default: ``None``

Verify SSL connection between Scrapy and S3 or S3-like storage. By default
SSL verification will occur.

.. setting:: AWS_REGION_NAME

AWS_REGION_NAME
---------------

Default: ``None``

The name of the region associated with the AWS client.

.. setting:: ASYNCIO_EVENT_LOOP

ASYNCIO_EVENT_LOOP
------------------

Default: ``None``

Import path of a given ``asyncio`` event loop class.

If the asyncio reactor is enabled (see :setting:`TWISTED_REACTOR`) this setting can be used to specify the
asyncio event loop to be used with it. Set the setting to the import path of the
desired asyncio event loop class. If the setting is set to ``None`` the default asyncio
event loop will be used.

If you are installing the asyncio reactor manually using the :func:`~scrapy.utils.reactor.install_reactor`
function, you can use the ``event_loop_path`` parameter to indicate the import path of the event loop
class to be used.

Note that the event loop class must inherit from :class:`asyncio.AbstractEventLoop`.

.. caution:: Please be aware that, when using a non-default event loop
    (either defined via :setting:`ASYNCIO_EVENT_LOOP` or installed with
    :func:`~scrapy.utils.reactor.install_reactor`), Scrapy will call
    :func:`asyncio.set_event_loop`, which will set the specified event loop
    as the current loop for the current OS thread.

.. setting:: BOT_NAME

BOT_NAME
--------

Default: ``<project name>`` (:ref:`fallback <default-settings>`: ``'scrapybot'``)

The name of the bot implemented by this Scrapy project (also known as the
project name). This name will be used for the logging too.

It's automatically populated with your project name when you create your
project with the :command:`startproject` command.

.. setting:: CONCURRENT_ITEMS

CONCURRENT_ITEMS
----------------

Default: ``100``

Maximum number of concurrent items (per response) to process in parallel in
:ref:`item pipelines <topics-item-pipeline>`.

.. setting:: CONCURRENT_REQUESTS

CONCURRENT_REQUESTS
-------------------

Default: ``16``

The maximum number of concurrent (i.e. simultaneous) requests that will be
performed by the Scrapy downloader.

.. setting:: CONCURRENT_REQUESTS_PER_DOMAIN

CONCURRENT_REQUESTS_PER_DOMAIN
------------------------------

Default: ``1`` (:ref:`fallback <default-settings>`: ``8``)

The maximum number of concurrent (i.e. simultaneous) requests that will be
performed to any single domain.

See also: :ref:`topics-autothrottle` and its
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.

.. setting:: DEFAULT_DROPITEM_LOG_LEVEL

DEFAULT_DROPITEM_LOG_LEVEL
--------------------------

Default: ``"WARNING"``

Default :ref:`log level <levels>` of messages about dropped items.

When an item is dropped by raising :exc:`scrapy.exceptions.DropItem` from the
:func:`process_item` method of an :ref:`item pipeline <topics-item-pipeline>`,
a message is logged, and by default its log level is the one configured in this
setting.

You may specify this log level as an integer (e.g. ``20``), as a log level
constant (e.g. ``logging.INFO``) or as a string with the name of a log level
constant (e.g. ``"INFO"``).

When writing an item pipeline, you can force a different log level by setting
:attr:`scrapy.exceptions.DropItem.log_level` in your
:exc:`scrapy.exceptions.DropItem` exception. For example:

.. code-block:: python

   from scrapy.exceptions import DropItem

   class MyPipeline:
       def process_item(self, item):
           if not item.get("price"):
               raise DropItem("Missing price data", log_level="INFO")
           return item

.. setting:: DEFAULT_ITEM_CLASS

DEFAULT_ITEM_CLASS
------------------

Default: ``'scrapy.Item'``

The default class that will be used for instantiating items in the :ref:`the
Scrapy shell <topics-shell>`.

.. setting:: DEFAULT_REQUEST_HEADERS

DEFAULT_REQUEST_HEADERS
-----------------------

Default:

.. code-block:: python

    {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en",
    }

The default headers used for Scrapy HTTP Requests. They're populated in the
:class:`~scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware`.

.. caution:: Cookies set via the ``Cookie`` header are not considered by the
    :ref:`cookies-mw`. If you need to set cookies for a request, use the
    :class:`Request.cookies <scrapy.Request>` parameter. This is a known
    current limitation that is being worked on.

.. setting:: DEPTH_LIMIT

DEPTH_LIMIT
-----------

Default: ``0``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

The maximum depth that will be allowed to crawl for any site. If zero, no limit
will be imposed.

.. setting:: DEPTH_PRIORITY

DEPTH_PRIORITY
--------------

Default: ``0``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

An integer that is used to adjust the :attr:`~scrapy.Request.priority` of
a :class:`~scrapy.Request` based on its depth.

The priority of a request is adjusted as follows:

.. skip: next
.. code-block:: python

    request.priority = request.priority - (depth * DEPTH_PRIORITY)

As depth increases, positive values of ``DEPTH_PRIORITY`` decrease request
priority (BFO), while negative values increase request priority (DFO). See
also :ref:`faq-bfo-dfo`.

.. note::

    This setting adjusts priority **in the opposite way** compared to
    other priority settings :setting:`REDIRECT_PRIORITY_ADJUST`
    and :setting:`RETRY_PRIORITY_ADJUST`.

.. setting:: DEPTH_STATS_VERBOSE

DEPTH_STATS_VERBOSE
-------------------

Default: ``False``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

Whether to collect verbose depth stats. If this is enabled, the number of
requests for each depth is collected in the stats.

.. setting:: DNSCACHE_ENABLED

DNSCACHE_ENABLED
----------------

Default: ``True``

Whether to enable DNS in-memory cache.

.. note::
    This setting is only used by
    :class:`~scrapy.resolver.CachingThreadedResolver` and
    :class:`~scrapy.resolver.CachingHostnameResolver`. It has no effect when
    :setting:`TWISTED_REACTOR_ENABLED` is ``False``, and may have no effect
    either when :setting:`DNS_RESOLVER` is set to a different resolver.

.. setting:: DNSCACHE_SIZE

DNSCACHE_SIZE
-------------

Default: ``10000``

DNS in-memory cache size, see :setting:`DNSCACHE_ENABLED`.

.. setting:: TWISTED_DNS_RESOLVER

TWISTED_DNS_RESOLVER
--------------------

Default: ``'scrapy.resolver.CachingThreadedResolver'``

The class to be used by Twisted to resolve DNS names. The default
``scrapy.resolver.CachingThreadedResolver`` supports specifying a timeout for
DNS requests via the :setting:`DNS_TIMEOUT` setting, but works only with IPv4
addresses. Scrapy provides an alternative resolver,
``scrapy.resolver.CachingHostnameResolver``, which supports IPv4/IPv6 addresses but does not
take the :setting:`DNS_TIMEOUT` setting into account.

.. note::
    This setting has no effect when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

.. setting:: DNS_TIMEOUT

DNS_TIMEOUT
-----------

Default: ``60``

Timeout for processing of DNS queries in seconds. Float is supported.

.. note::
    This setting is only used by
    :class:`~scrapy.resolver.CachingThreadedResolver`. It has no effect when
    :setting:`TWISTED_REACTOR_ENABLED` is ``False``, and may have no effect
    either when :setting:`DNS_RESOLVER` is set to a different resolver.

.. setting:: DOWNLOADER

DOWNLOADER
----------

Default: ``'scrapy.core.downloader.Downloader'``

The downloader to use for crawling.

.. setting:: DOWNLOADER_CLIENT_TLS_CIPHERS

DOWNLOADER_CLIENT_TLS_CIPHERS
-----------------------------

Default: ``'DEFAULT'``

Use this setting to customize the TLS/SSL ciphers used by the HTTPS download
handler.

The setting should contain a string in the `OpenSSL cipher list format`_,
these ciphers will be used as client ciphers. Changing this setting may be
necessary to access certain HTTPS websites: for example, you may need to use
``'DEFAULT:!DH'`` for a website with weak DH parameters or enable a
specific cipher that is not included in ``DEFAULT`` if a website requires it.

.. _OpenSSL cipher list format: https://docs.openssl.org/master/man1/openssl-ciphers/#cipher-list-format

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. It's currently unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOADER_CLIENT_TLS_METHOD

DOWNLOADER_CLIENT_TLS_METHOD
----------------------------

Default: ``'TLS'``

Use this setting to customize the TLS/SSL method used by the HTTPS download
handler.

This setting must be one of these string values:

- ``'TLS'``: maps to OpenSSL's ``TLS_method()`` (a.k.a ``SSLv23_method()``),
  which allows protocol negotiation, starting from the highest supported
  by the platform; **default, recommended**
- ``'TLSv1.0'``: this value forces HTTPS connections to use TLS version 1.0 ;
  set this if you want the behavior of Scrapy<1.1
- ``'TLSv1.1'``: forces TLS version 1.1
- ``'TLSv1.2'``: forces TLS version 1.2

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. It's currently unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING

DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING
-------------------------------------

Default: ``False``

Setting this to ``True`` will enable DEBUG level messages about TLS connection
parameters after establishing HTTPS connections. The kind of information logged
depends on the implementation of the download handler and the versions of
the TLS-related libraries.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES
----------------------

Default: ``{}``

A dict containing the downloader middlewares enabled in your project, and their
orders. For more info see :ref:`topics-downloader-middleware-setting`.

.. setting:: DOWNLOADER_MIDDLEWARES_BASE

DOWNLOADER_MIDDLEWARES_BASE
---------------------------

Default:

.. code-block:: python

    {
        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
        "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
        "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
        "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
        "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
        "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
        "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
        "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
        "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
        "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
        "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
        "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
        "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
        "scrapy.downloadermiddlewares.stats.DownloaderStats": 850,
        "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900,
    }

A dict containing the downloader middlewares enabled by default in Scrapy. Low
orders are closer to the engine, high orders are closer to the downloader. You
should never modify this setting in your project, modify
:setting:`DOWNLOADER_MIDDLEWARES` instead.  For more info see
:ref:`topics-downloader-middleware-setting`.

.. setting:: DOWNLOADER_STATS

DOWNLOADER_STATS
----------------

Default: ``True``

Whether to enable downloader stats collection.

.. setting:: DOWNLOAD_DELAY

DOWNLOAD_DELAY
--------------

Default: ``1`` (:ref:`fallback <default-settings>`: ``0``)

Minimum seconds to wait between 2 consecutive requests to the same domain.

Use :setting:`DOWNLOAD_DELAY` to throttle your crawling speed, to avoid hitting
servers too hard.

Decimal numbers are supported. For example, to send a maximum of 4 requests
every 10 seconds::

    DOWNLOAD_DELAY = 2.5

This setting is also affected by the :setting:`RANDOMIZE_DOWNLOAD_DELAY`
setting, which is enabled by default.

Note that :setting:`DOWNLOAD_DELAY` can lower the effective per-domain
concurrency below :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`. If the response
time of a domain is lower than :setting:`DOWNLOAD_DELAY`, the effective
concurrency for that domain is 1. When testing throttling configurations, it
usually makes sense to lower :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` first,
and only increase :setting:`DOWNLOAD_DELAY` once
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` is 1 but a higher throttling is
desired.

.. _spider-download_delay-attribute:

.. note::

    This delay can be set per spider using :attr:`download_delay` spider attribute.

It is also possible to change this setting per domain, although it requires
non-trivial code. See the implementation of the :ref:`AutoThrottle
<topics-autothrottle>` extension for an example.

.. setting:: DOWNLOAD_BIND_ADDRESS

DOWNLOAD_BIND_ADDRESS
---------------------

Default: ``None``

The default local outgoing address for download-handler connections.

This setting can be either:

- a host address as a string (e.g. ``"127.0.0.2"``), in which case the local
  port is chosen automatically, or

- a ``(host, port)`` tuple (e.g. ``("127.0.0.2", 50000)``) to bind to both a
  specific local interface and a specific local port.

For example:

.. code-block:: python

    # Bind to this local address
    DOWNLOAD_BIND_ADDRESS = "127.0.0.2"

.. code-block:: python

    # Bind to this local address and local port
    DOWNLOAD_BIND_ADDRESS = ("127.0.0.2", 5000)

If set, built-in HTTP download handlers use this value by default.
Set the :reqmeta:`bindaddress` request meta key to override it for a specific
request.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. Specifying the port is unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOAD_HANDLERS

DOWNLOAD_HANDLERS
-----------------

Default: ``{}``

A dict containing the :ref:`download handlers <topics-download-handlers>`
enabled in your project.

See :setting:`DOWNLOAD_HANDLERS_BASE` for example format.

.. setting:: DOWNLOAD_HANDLERS_BASE

DOWNLOAD_HANDLERS_BASE
----------------------

Default:

.. code-block:: python

    {
        "data": "scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler",
        "file": "scrapy.core.downloader.handlers.file.FileDownloadHandler",
        "http": "scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler",
        "https": "scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler",
        "s3": "scrapy.core.downloader.handlers.s3.S3DownloadHandler",
        "ftp": "scrapy.core.downloader.handlers.ftp.FTPDownloadHandler",
    }

(when :setting:`TWISTED_REACTOR_ENABLED` is ``True``)

.. code-block:: python

    {
        "data": "scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler",
        "file": "scrapy.core.downloader.handlers.file.FileDownloadHandler",
        "http": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
        "https": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
        "s3": "scrapy.core.downloader.handlers.s3.S3DownloadHandler",
        "ftp": None,
    }

(when :setting:`TWISTED_REACTOR_ENABLED` is ``False``)

A dict containing the :ref:`download handlers <topics-download-handlers>`
enabled by default in Scrapy. You should never modify this setting in your
project, modify :setting:`DOWNLOAD_HANDLERS` instead.

You can disable any of these download handlers by assigning ``None`` to their
URI scheme in :setting:`DOWNLOAD_HANDLERS`. E.g., to disable the built-in FTP
handler (without replacement), place this in your ``settings.py``:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        "ftp": None,
    }

.. setting:: DOWNLOAD_SLOTS

DOWNLOAD_SLOTS
--------------

Default: ``{}``

Allows to define concurrency/delay parameters on per slot (domain) basis:

    .. code-block:: python

        DOWNLOAD_SLOTS = {
            "quotes.toscrape.com": {"concurrency": 1, "delay": 2, "randomize_delay": False},
            "books.toscrape.com": {"delay": 3, "randomize_delay": False},
        }

.. note::

    For other downloader slots default settings values will be used:

    -   :setting:`DOWNLOAD_DELAY`: ``delay``
    -   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`: ``concurrency``
    -   :setting:`RANDOMIZE_DOWNLOAD_DELAY`: ``randomize_delay``

.. setting:: DOWNLOAD_TIMEOUT

DOWNLOAD_TIMEOUT
----------------

Default: ``180``

The amount of time (in secs) that the downloader will wait before timing out.

.. note::

    This timeout can be per-request using the :reqmeta:`download_timeout`
    :attr:`.Request.meta` key.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: DOWNLOAD_MAXSIZE
.. reqmeta:: download_maxsize

DOWNLOAD_MAXSIZE
----------------

Default: ``1073741824`` (1 GiB)

The maximum response body size (in bytes) allowed. Bigger responses are
aborted and ignored.

This applies both before and after compression. If decompressing a response
body would exceed this limit, decompression is aborted and the response is
ignored.

Use ``0`` to disable this limit.

.. note::

    This limit can be set per-request using the :reqmeta:`download_maxsize`
    :attr:`.Request.meta` key.

.. note::

    Checking responses before decompressing them needs to be implemented inside
    the :ref:`download handler <topics-download-handlers>`, so it's not
    guaranteed to be supported by all 3rd-party handlers.

.. setting:: DOWNLOAD_WARNSIZE
.. reqmeta:: download_warnsize

DOWNLOAD_WARNSIZE
-----------------

Default: ``33554432`` (32 MiB)

If the size of a response exceeds this value, before or after compression, a
warning will be logged about it.

Use ``0`` to disable this limit.

.. note::

    This limit can be set per-request using the :reqmeta:`download_warnsize`
    :attr:`.Request.meta` key.

.. note::

    Checking responses before decompressing them needs to be implemented inside
    the :ref:`download handler <topics-download-handlers>`, so it's not
    guaranteed to be supported by all 3rd-party handlers.

.. setting:: DOWNLOAD_FAIL_ON_DATALOSS

DOWNLOAD_FAIL_ON_DATALOSS
-------------------------

Default: ``True``

Whether or not to fail on broken responses, that is, when the declared
``Content-Length`` does not match content sent by the server or a chunked
response was not properly finished. If ``True``, these responses raise a
:exc:`~scrapy.exceptions.ResponseDataLossError` exception. If ``False``, these
responses are passed through and the flag ``dataloss`` is added to the
response, i.e.: ``'dataloss' in response.flags`` is ``True``.

Optionally, this can be set per-request basis by using the
:reqmeta:`download_fail_on_dataloss` Request.meta key to ``False``.

.. note::

  A broken response, or data loss error, may happen under several
  circumstances, from server misconfiguration to network errors to data
  corruption. It is up to the user to decide if it makes sense to process
  broken responses considering they may contain partial or incomplete content.
  If :setting:`RETRY_ENABLED` is ``True`` and this setting is set to ``True``,
  the :exc:`~scrapy.exceptions.ResponseDataLossError` failure will be retried
  as usual.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. warning::

    This setting is ignored by the
    :class:`~scrapy.core.downloader.handlers.http2.H2DownloadHandler`
    :ref:`download handler <topics-download-handlers>`. In case of a data loss
    error, the corresponding HTTP/2 connection may be corrupted, affecting other
    requests that use the same connection; hence, a ``ResponseFailed([InvalidBodyLengthError])``
    failure is always raised for every request that was using that connection.

.. setting:: DOWNLOAD_VERIFY_CERTIFICATES

DOWNLOAD_VERIFY_CERTIFICATES
----------------------------

Default: ``False``

Whether the HTTPS download handlers should verify the server TLS certificate
when making a request and abort the request if the verification fails.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. The exact behavior of a handler (e.g. whether
    certificate problems are logged when this setting is set to ``False``)
    depends on its implementation.

.. setting:: DUPEFILTER_CLASS

DUPEFILTER_CLASS
----------------

Default: ``'scrapy.dupefilters.RFPDupeFilter'``

The class used to detect and filter duplicate requests.

The default, :class:`~scrapy.dupefilters.RFPDupeFilter`, filters based on the
:setting:`REQUEST_FINGERPRINTER_CLASS` setting.

To change how duplicates are checked, you can point :setting:`DUPEFILTER_CLASS`
to a custom subclass of :class:`~scrapy.dupefilters.RFPDupeFilter` that
overrides its ``__init__`` method to use a :ref:`different request
fingerprinting class <custom-request-fingerprinter>`. For example:

.. code-block:: python

    from scrapy.dupefilters import RFPDupeFilter
    from scrapy.utils.request import fingerprint

    class CustomRequestFingerprinter:
        def fingerprint(self, request):
            return fingerprint(request, include_headers=["X-ID"])

    class CustomDupeFilter(RFPDupeFilter):

        def __init__(self, path=None, debug=False, *, fingerprinter=None):
            super().__init__(
                path=path, debug=debug, fingerprinter=CustomRequestFingerprinter()
            )

To disable duplicate request filtering set :setting:`DUPEFILTER_CLASS` to
``'scrapy.dupefilters.BaseDupeFilter'``. Note that not filtering out duplicate
requests may cause crawling loops. It is usually better to set
the ``dont_filter`` parameter to ``True`` on the ``__init__`` method of a
specific :class:`~scrapy.Request` object that should not be filtered out.

A class assigned to :setting:`DUPEFILTER_CLASS` must implement the following
interface::

    class MyDupeFilter:

        @classmethod
        def from_crawler(cls, crawler):
            """Returns an instance of this duplicate request filtering class
            based on the current Crawler instance."""
            return cls()

        def request_seen(self, request):
            """Returns ``True`` if *request* is a duplicate of another request
            seen in a previous call to :meth:`request_seen`, or ``False``
            otherwise."""
            return False

        def open(self):
            """Called before the spider opens. It may return a deferred."""
            pass

        def close(self, reason):
            """Called before the spider closes. It may return a deferred."""
            pass

        def log(self, request, spider):
            """Logs that a request has been filtered out.

            It is called right after a call to :meth:`request_seen` that
            returns ``True``.

            If :meth:`request_seen` always returns ``False``, such as in the
            case of :class:`~scrapy.dupefilters.BaseDupeFilter`, this method
            may be omitted.
            """
            pass

.. autoclass:: scrapy.dupefilters.BaseDupeFilter

.. autoclass:: scrapy.dupefilters.RFPDupeFilter

.. setting:: DUPEFILTER_DEBUG

DUPEFILTER_DEBUG
----------------

Default: ``False``

By default, ``RFPDupeFilter`` only logs the first duplicate request.
Setting :setting:`DUPEFILTER_DEBUG` to ``True`` will make it log all duplicate requests.

.. setting:: EDITOR

EDITOR
------

Default: ``vi`` (on Unix systems) or the IDLE editor (on Windows)

The editor to use for editing spiders with the :command:`edit` command.
Additionally, if the ``EDITOR`` environment variable is set, the :command:`edit`
command will prefer it over the default setting.

.. setting:: EXTENSIONS

EXTENSIONS
----------

Default: ``{}``

:ref:`Component priority dictionary <component-priority-dictionaries>` of
enabled extensions. See :ref:`topics-extensions`.

.. setting:: EXTENSIONS_BASE

EXTENSIONS_BASE
---------------

Default:

.. code-block:: python

    {
        "scrapy.extensions.corestats.CoreStats": 0,
        "scrapy.extensions.telnet.TelnetConsole": 0,
        "scrapy.extensions.memusage.MemoryUsage": 0,
        "scrapy.extensions.memdebug.MemoryDebugger": 0,
        "scrapy.extensions.closespider.CloseSpider": 0,
        "scrapy.extensions.feedexport.FeedExporter": 0,
        "scrapy.extensions.logstats.LogStats": 0,
        "scrapy.extensions.spiderstate.SpiderState": 0,
        "scrapy.extensions.throttle.AutoThrottle": 0,
    }

A dict containing the extensions available by default in Scrapy, and their
orders. This setting contains all stable built-in extensions. Keep in mind that
some of them need to be enabled through a setting.

For more information See the :ref:`extensions user guide  <topics-extensions>`
and the :ref:`list of available extensions <topics-extensions-ref>`.

.. setting:: FEED_TEMPDIR

FEED_TEMPDIR
------------

The Feed Temp dir allows you to set a custom folder to save crawler
temporary files before uploading with :ref:`FTP feed storage <topics-feed-storage-ftp>` and
:ref:`Amazon S3 <topics-feed-storage-s3>`.

.. setting:: FEED_STORAGE_GCS_ACL

FEED_STORAGE_GCS_ACL
--------------------

The Access Control List (ACL) used when storing items to :ref:`Google Cloud Storage <topics-feed-storage-gcs>`.
For more information on how to set this value, please refer to the column *JSON API* in `Google Cloud documentation <https://docs.cloud.google.com/storage/docs/access-control/lists>`_.

.. setting:: FORCE_CRAWLER_PROCESS

FORCE_CRAWLER_PROCESS
---------------------

Default: ``False``

If ``False``, :ref:`Scrapy commands that need a CrawlerProcess
<topics-commands-crawlerprocess>` will decide between using
:class:`scrapy.crawler.AsyncCrawlerProcess` and
:class:`scrapy.crawler.CrawlerProcess` based on the value of the
:setting:`TWISTED_REACTOR` setting, but ignoring its value in :ref:`per-spider
settings <spider-settings>`.

If ``True``, these commands will always use
:class:`~scrapy.crawler.CrawlerProcess`.

Set this to ``True`` if you want to set :setting:`TWISTED_REACTOR` to a
non-default value in :ref:`per-spider settings <spider-settings>`.

.. setting:: FTP_PASSIVE_MODE

FTP_PASSIVE_MODE
----------------

Default: ``True``

Whether or not to use passive mode when initiating FTP transfers.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. reqmeta:: ftp_password
.. setting:: FTP_PASSWORD

FTP_PASSWORD
------------

Default: ``"guest"``

The password to use for FTP connections when there is no ``"ftp_password"``
in ``Request`` meta.

.. note::
    Paraphrasing `RFC 1635`_, although it is common to use either the password
    "guest" or one's e-mail address for anonymous FTP,
    some FTP servers explicitly ask for the user's e-mail address
    and will not allow login with the "guest" password.

.. _RFC 1635: https://datatracker.ietf.org/doc/html/rfc1635

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. reqmeta:: ftp_user
.. setting:: FTP_USER

FTP_USER
--------

Default: ``"anonymous"``

The username to use for FTP connections when there is no ``"ftp_user"``
in ``Request`` meta.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: GCS_PROJECT_ID

GCS_PROJECT_ID
-----------------

Default: ``None``

The Project ID that will be used when storing data on `Google Cloud Storage`_.

.. setting:: ITEM_PIPELINES

ITEM_PIPELINES
--------------

Default: ``{}``

A dict containing the item pipelines to use, and their orders. Order values are
arbitrary, but it is customary to define them in the 0-1000 range. Lower orders
process before higher orders.

Example:

.. code-block:: python

   ITEM_PIPELINES = {
       "mybot.pipelines.validate.ValidateMyItem": 300,
       "mybot.pipelines.validate.StoreMyItem": 800,
   }

.. setting:: ITEM_PIPELINES_BASE

ITEM_PIPELINES_BASE
-------------------

Default: ``{}``

A dict containing the pipelines enabled by default in Scrapy. You should never
modify this setting in your project, modify :setting:`ITEM_PIPELINES` instead.

.. setting:: JOBDIR

JOBDIR
------

Default: ``None``

A string indicating the directory for storing the state of a crawl when
:ref:`pausing and resuming crawls <topics-jobs>`.

.. setting:: LOG_ENABLED

LOG_ENABLED
-----------

Default: ``True``

Whether to enable logging.

.. setting:: LOG_ENCODING

LOG_ENCODING
------------

Default: ``'utf-8'``

The encoding to use for logging.

.. setting:: LOG_FILE

LOG_FILE
--------

Default: ``None``

File name to use for logging output. If ``None``, standard error will be used.

.. setting:: LOG_FILE_APPEND

LOG_FILE_APPEND
---------------

Default: ``True``

If ``False``, the log file specified with :setting:`LOG_FILE` will be
overwritten (discarding the output from previous runs, if any).

.. setting:: LOG_FORMAT

LOG_FORMAT
----------

Default: ``'%(asctime)s [%(name)s] %(levelname)s: %(message)s'``

String for formatting log messages. Refer to the
:ref:`Python logging documentation <logrecord-attributes>` for the whole
list of available placeholders.

.. setting:: LOG_DATEFORMAT

LOG_DATEFORMAT
--------------

Default: ``'%Y-%m-%d %H:%M:%S'``

String for formatting date/time, expansion of the ``%(asctime)s`` placeholder
in :setting:`LOG_FORMAT`. Refer to the
:ref:`Python datetime documentation <strftime-strptime-behavior>` for the
whole list of available directives.

.. setting:: LOG_FORMATTER

LOG_FORMATTER
-------------

Default: :class:`scrapy.logformatter.LogFormatter`

The class to use for :ref:`formatting log messages <custom-log-formats>` for different actions.

.. setting:: LOG_LEVEL

LOG_LEVEL
---------

Default: ``'DEBUG'``

Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING,
INFO, DEBUG. For more info see :ref:`topics-logging`.

.. setting:: LOG_STDOUT

LOG_STDOUT
----------

Default: ``False``

If ``True``, all standard output (and error) of your process will be redirected
to the log. For example if you ``print('hello')`` it will appear in the Scrapy
log.

.. setting:: LOG_SHORT_NAMES

LOG_SHORT_NAMES
---------------

Default: ``False``

If ``True``, the logs will just contain the root path. If it is set to ``False``
then it displays the component responsible for the log output

.. setting:: LOG_VERSIONS

LOG_VERSIONS
------------

Default: ``["lxml", "libxml2", "cssselect", "parsel", "w3lib", "Twisted", "Python", "pyOpenSSL", "cryptography", "Platform"]``

Logs the installed versions of the specified items.

An item can be any installed Python package.

The following special items are also supported:

-   ``libxml2``

-   ``Platform`` (:func:`platform.platform`)

-   ``Python``

.. setting:: LOGSTATS_INTERVAL

LOGSTATS_INTERVAL
-----------------

Default: ``60.0``

The interval (in seconds) between each logging printout of the stats
by :class:`~scrapy.extensions.logstats.LogStats`.

.. setting:: MEMDEBUG_ENABLED

MEMDEBUG_ENABLED
----------------

Default: ``False``

Whether to enable memory debugging.

.. setting:: MEMDEBUG_NOTIFY

MEMDEBUG_NOTIFY
---------------

Default: ``[]``

When memory debugging is enabled a memory report will be sent to the specified
addresses if this setting is not empty, otherwise the report will be written to
the log.

Example::

    MEMDEBUG_NOTIFY = ['user@example.com']

.. setting:: MEMUSAGE_ENABLED

MEMUSAGE_ENABLED
----------------

Default: ``True``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

Whether to enable the memory usage extension. This extension keeps track of
a peak memory used by the process (it writes it to stats). It can also
optionally shutdown the Scrapy process when it exceeds a memory limit
(see :setting:`MEMUSAGE_LIMIT_MB`).

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_LIMIT_MB

MEMUSAGE_LIMIT_MB
-----------------

Default: ``0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The maximum amount of memory to allow (in megabytes) before shutting down
Scrapy (if :setting:`MEMUSAGE_ENABLED` is ``True``). If zero, no check will be
performed.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_CHECK_INTERVAL_SECONDS

MEMUSAGE_CHECK_INTERVAL_SECONDS
-------------------------------

Default: ``60.0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
checks the current memory usage, versus the limits set by
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
at fixed time intervals.

This sets the length of these intervals, in seconds.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_WARNING_MB

MEMUSAGE_WARNING_MB
-------------------

Default: ``0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The maximum amount of memory to allow (in megabytes) before sending a
:signal:`memusage_warning_reached` signal (if :setting:`MEMUSAGE_ENABLED` is
``True``). If zero, no signal will be sent.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: NEWSPIDER_MODULE

NEWSPIDER_MODULE
----------------

Default: ``"<project name>.spiders"`` (:ref:`fallback <default-settings>`: ``""``)

Module where to create new spiders using the :command:`genspider` command.

Example::

    NEWSPIDER_MODULE = 'mybot.spiders_dev'

.. setting:: RANDOMIZE_DOWNLOAD_DELAY

RANDOMIZE_DOWNLOAD_DELAY
------------------------

Default: ``True``

If enabled, Scrapy will wait a random amount of time (between 0.5 * :setting:`DOWNLOAD_DELAY` and 1.5 * :setting:`DOWNLOAD_DELAY`) while fetching requests from the same
website.

This randomization decreases the chance of the crawler being detected (and
subsequently blocked) by sites which analyze requests looking for statistically
significant similarities in the time between their requests.

The randomization policy is the same used by `wget`_ ``--random-wait`` option.

If :setting:`DOWNLOAD_DELAY` is zero (default) this option has no effect.

.. _wget: https://www.gnu.org/software/wget/manual/wget.html

.. setting:: REACTOR_THREADPOOL_MAXSIZE

REACTOR_THREADPOOL_MAXSIZE
--------------------------

Default: ``10``

The maximum limit for Twisted Reactor thread pool size. This is common
multi-purpose thread pool used by various Scrapy components. Threaded
DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase
this value if you're experiencing problems with insufficient blocking IO.

.. setting:: REDIRECT_PRIORITY_ADJUST

REDIRECT_PRIORITY_ADJUST
------------------------

Default: ``+2``

Scope: ``scrapy.downloadermiddlewares.redirect.RedirectMiddleware``

Adjust redirect request priority relative to original request:

- **a positive priority adjust (default) means higher priority.**
- a negative priority adjust means lower priority.

.. setting:: ROBOTSTXT_OBEY

ROBOTSTXT_OBEY
--------------

Default: ``True`` (:ref:`fallback <default-settings>`: ``False``)

If enabled, Scrapy will respect robots.txt policies. For more information see
:ref:`topics-dlmw-robots`.

.. note::

    While the default value is ``False`` for historical reasons,
    this option is enabled by default in settings.py file generated
    by ``scrapy startproject`` command.

.. setting:: ROBOTSTXT_PARSER

ROBOTSTXT_PARSER
----------------

Default: ``'scrapy.robotstxt.ProtegoRobotParser'``

The parser backend to use for parsing ``robots.txt`` files. For more information see
:ref:`topics-dlmw-robots`.

.. setting:: ROBOTSTXT_USER_AGENT

ROBOTSTXT_USER_AGENT
^^^^^^^^^^^^^^^^^^^^

Default: ``None``

The user agent string to use for matching in the robots.txt file. If ``None``,
the User-Agent header you are sending with the request or the
:setting:`USER_AGENT` setting (in that order) will be used for determining
the user agent to use in the robots.txt file.

.. setting:: SCHEDULER

SCHEDULER
---------

Default: :class:`~scrapy.core.scheduler.Scheduler`

The scheduler class to be used for crawling. See :ref:`topics-scheduler` for
details.

.. setting:: SCHEDULER_DEBUG

SCHEDULER_DEBUG
---------------

Default: ``False``

Setting to ``True`` will log debug information about the requests scheduler.
This currently logs (only once) if the requests cannot be serialized to disk.
Stats counter (``scheduler/unserializable``) tracks the number of times this happens.

Example entry in logs::

    1956-01-31 00:00:00+0800 [scrapy.core.scheduler] ERROR: Unable to serialize request:
    <GET http://example.com> - reason: cannot serialize <Request at 0x9a7c7ec>
    (type Request)> - no more unserializable requests will be logged
    (see 'scheduler/unserializable' stats counter)

.. setting:: SCHEDULER_DISK_QUEUE

SCHEDULER_DISK_QUEUE
--------------------

Default: ``'scrapy.squeues.PickleLifoDiskQueue'``

Type of disk queue that will be used by the scheduler. Other available types
are ``scrapy.squeues.PickleFifoDiskQueue``,
``scrapy.squeues.MarshalFifoDiskQueue``,
``scrapy.squeues.MarshalLifoDiskQueue``.

.. setting:: SCHEDULER_MEMORY_QUEUE

SCHEDULER_MEMORY_QUEUE
----------------------

Default: ``'scrapy.squeues.LifoMemoryQueue'``

Type of in-memory queue used by the scheduler. Other available type is:
``scrapy.squeues.FifoMemoryQueue``.

.. setting:: SCHEDULER_PRIORITY_QUEUE

SCHEDULER_PRIORITY_QUEUE
------------------------

Default: :class:`~scrapy.pqueues.DownloaderAwarePriorityQueue`

Type of priority queue used by the scheduler.

Another available type is :class:`~scrapy.pqueues.ScrapyPriorityQueue`.

:class:`~scrapy.pqueues.DownloaderAwarePriorityQueue` works better than
:class:`~scrapy.pqueues.ScrapyPriorityQueue` when you crawl many different
domains in parallel.

.. setting:: SCHEDULER_START_DISK_QUEUE

SCHEDULER_START_DISK_QUEUE
--------------------------

Default: ``'scrapy.squeues.PickleFifoDiskQueue'``

Type of disk queue (see :setting:`JOBDIR`) that the :ref:`scheduler
<topics-scheduler>` uses for :ref:`start requests <start-requests>`.

For available choices, see :setting:`SCHEDULER_DISK_QUEUE`.

.. queue-common-starts

Use ``None`` or ``""`` to disable these separate queues entirely, and instead
have start requests share the same queues as other requests.

.. note::

    Disabling separate start request queues makes :ref:`start request order
    <start-request-order>` unintuitive: start requests will be sent in order
    only until :setting:`CONCURRENT_REQUESTS` is reached, then remaining start
    requests will be sent in reverse order.

.. queue-common-ends

.. setting:: SCHEDULER_START_MEMORY_QUEUE

SCHEDULER_START_MEMORY_QUEUE
----------------------------

Default: ``'scrapy.squeues.FifoMemoryQueue'``

Type of in-memory queue that the :ref:`scheduler <topics-scheduler>` uses for
:ref:`start requests <start-requests>`.

For available choices, see :setting:`SCHEDULER_MEMORY_QUEUE`.

.. _topics-settings:

========
Settings
========

The Scrapy settings allows you to customize the behaviour of all Scrapy
components, including the core, extensions, pipelines and spiders themselves.

The infrastructure of the settings provides a global namespace of key-value mappings
that the code can use to pull configuration values from. The settings can be
populated through different mechanisms, which are described below.

The settings are also the mechanism for selecting the currently active Scrapy
project (in case you have many).

For a list of available built-in settings see: :ref:`topics-settings-ref`.

.. _topics-settings-module-envvar:

Designating the settings
========================

When you use Scrapy, you have to tell it which settings you're using. You can
do this by using an environment variable, ``SCRAPY_SETTINGS_MODULE``.

The value of ``SCRAPY_SETTINGS_MODULE`` should be in Python path syntax, e.g.
``myproject.settings``. Note that the settings module should be on the
Python :ref:`import search path <tut-searchpath>`.

.. _populating-settings:

Populating the settings
=======================

Settings can be populated using different mechanisms, each of which has a
different precedence:

 1. :ref:`Command-line settings <cli-settings>` (highest precedence)
 2. :ref:`Spider settings <spider-settings>`
 3. :ref:`Project settings <project-settings>`
 4. :ref:`Add-on settings <addon-settings>`
 5. :ref:`Command-specific default settings <cmd-default-settings>`
 6. :ref:`Global default settings <default-settings>` (lowest precedence)

.. _cli-settings:

1. Command-line settings
------------------------

Settings set in the command line have the highest precedence, overriding any
other settings.

You can explicitly override one or more settings using the ``-s`` (or
``--set``) command-line option.

.. highlight:: sh

Example::

    scrapy crawl myspider -s LOG_LEVEL=INFO -s LOG_FILE=scrapy.log

.. _spider-settings:

2. Spider settings
------------------

:ref:`Spiders <topics-spiders>` can define their own settings that will take
precedence and override the project ones.

.. note:: :ref:`Pre-crawler settings <pre-crawler-settings>` cannot be defined
    per spider, and :ref:`reactor settings <reactor-settings>` should not have
    a different value per spider when :ref:`running multiple spiders in the
    same process <run-multiple-spiders>`.

One way to do so is by setting their :attr:`~scrapy.Spider.custom_settings`
attribute:

.. code-block:: python

    import scrapy


    class MySpider(scrapy.Spider):
        name = "myspider"

        custom_settings = {
            "SOME_SETTING": "some value",
        }

It's often better to implement :meth:`~scrapy.Spider.update_settings` instead,
and settings set there should use the ``"spider"`` priority explicitly:

.. code-block:: python

    import scrapy


    class MySpider(scrapy.Spider):
        name = "myspider"

        @classmethod
        def update_settings(cls, settings):
            super().update_settings(settings)
            settings.set("SOME_SETTING", "some value", priority="spider")

.. versionadded:: 2.11

It's also possible to modify the settings in the
:meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider
arguments <spiderargs>` or other logic:

.. code-block:: python

    import scrapy


    class MySpider(scrapy.Spider):
        name = "myspider"

        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super().from_crawler(crawler, *args, **kwargs)
            if "some_argument" in kwargs:
                spider.settings.set(
                    "SOME_SETTING", kwargs["some_argument"], priority="spider"
                )
            return spider

.. _project-settings:

3. Project settings
-------------------

Scrapy projects include a settings module, usually a file called
``settings.py``, where you should populate most settings that apply to all your
spiders.

.. seealso:: :ref:`topics-settings-module-envvar`

.. _addon-settings:

4. Add-on settings
------------------

:ref:`Add-ons <topics-addons>` can modify settings. They should do this with
``"addon"`` priority where possible.

.. _cmd-default-settings:

5. Command-specific default settings
------------------------------------

Each :ref:`Scrapy command <topics-commands>` can have its own default settings,
which override the :ref:`global default settings <default-settings>`.

Those command-specific default settings are specified in the
``default_settings`` attribute of each command class.

.. _default-settings:

6. Default global settings
--------------------------

The ``scrapy.settings.default_settings`` module defines global default values
for some :ref:`built-in settings <topics-settings-ref>`.

.. note:: :command:`startproject` generates a ``settings.py`` file that sets
    some settings to different values.

    The reference documentation of settings indicates the default value if one
    exists. If :command:`startproject` sets a value, that value is documented
    as default, and the value from ``scrapy.settings.default_settings`` is
    documented as “fallback”.


Compatibility with pickle
=========================

Setting values must be :ref:`picklable <pickle-picklable>`.

Import paths and classes
========================

When a setting references a callable object to be imported by Scrapy, such as a
class or a function, there are two different ways you can specify that object:

-   As a string containing the import path of that object

-   As the object itself

For example:

.. skip: next
.. code-block:: python

   from mybot.pipelines.validate import ValidateMyItem

   ITEM_PIPELINES = {
       # passing the classname...
       ValidateMyItem: 300,
       # ...equals passing the class path
       "mybot.pipelines.validate.ValidateMyItem": 300,
   }

.. note:: Passing non-callable objects is not supported.


How to access settings
======================

.. highlight:: python

In a spider, settings are available through ``self.settings``:

.. code-block:: python

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = ["http://example.com"]

        def parse(self, response):
            print(f"Existing settings: {self.settings.attributes.keys()}")

.. note::
    The ``settings`` attribute is set in the base Spider class after the spider
    is initialized.  If you want to use settings before the initialization
    (e.g., in your spider's ``__init__()`` method), you'll need to override the
    :meth:`~scrapy.Spider.from_crawler` method.

:ref:`Components <topics-components>` can also :ref:`access settings
<component-settings>`.

The ``settings`` object can be used like a :class:`dict` (e.g.
``settings["LOG_ENABLED"]``). However, to support non-string setting values,
which may be passed from the command line as strings, it is recommended to use
one of the methods provided by the :class:`~scrapy.settings.Settings` API.


.. _component-priority-dictionaries:

Component priority dictionaries
===============================

A **component priority dictionary** is a :class:`dict` where keys are
:ref:`components <topics-components>` and values are component priorities. For
example:

.. skip: next
.. code-block:: python

    {
        "path.to.ComponentA": None,
        ComponentB: 100,
    }

A component can be specified either as a class object or through an import
path.

.. warning:: Component priority dictionaries are regular :class:`dict` objects.
    Be careful not to define the same component more than once, e.g. with
    different import path strings or defining both an import path and a
    :class:`type` object.

A priority can be an :class:`int` or :data:`None`.

A component with priority 1 goes *before* a component with priority 2. What
going before entails, however, depends on the corresponding setting. For
example, in the :setting:`DOWNLOADER_MIDDLEWARES` setting, components have
their
:meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_request`
method executed before that of later components, but have their
:meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response`
method executed after that of later components.

A component with priority :data:`None` is disabled.

Some component priority dictionaries get merged with some built-in value. For
example, :setting:`DOWNLOADER_MIDDLEWARES` is merged with
:setting:`DOWNLOADER_MIDDLEWARES_BASE`. This is where :data:`None` comes in
handy, allowing you to disable a component from the base setting in the regular
setting:

.. code-block:: python

    DOWNLOADER_MIDDLEWARES = {
        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
    }


Special settings
================

The following settings work slightly differently than all other settings.

.. _pre-crawler-settings:

Pre-crawler settings
--------------------

**Pre-crawler settings** are settings used before the
:class:`~scrapy.crawler.Crawler` object is created.

These settings cannot be :ref:`set from a spider <spider-settings>`.

These settings are:

-   :setting:`TWISTED_REACTOR_ENABLED`
-   :setting:`SPIDER_LOADER_CLASS` and settings used by the corresponding
    spider loader class, e.g. :setting:`SPIDER_MODULES` and
    :setting:`SPIDER_LOADER_WARN_ONLY` for the default spider loader class.

.. _reactor-settings:

Reactor settings
----------------

**Reactor settings** are settings tied to the :doc:`Twisted reactor
<twisted:core/howto/reactor-basics>`.

These settings can be defined from a spider. However, because only 1 reactor
can be used per process, these settings cannot use a different value per spider
when :ref:`running multiple spiders in the same process
<run-multiple-spiders>`.

In general, if different spiders define different values, the first defined
value is used. However, if two spiders request a different reactor, an
exception is raised.

These settings are:

-   :setting:`ASYNCIO_EVENT_LOOP` (not possible to set per-spider when using
    :class:`~scrapy.crawler.AsyncCrawlerProcess`, see below)

-   :setting:`TWISTED_DNS_RESOLVER` and settings used by the corresponding
    component, e.g. :setting:`DNSCACHE_ENABLED`, :setting:`DNSCACHE_SIZE`
    and :setting:`DNS_TIMEOUT` for the default one.

-   :setting:`REACTOR_THREADPOOL_MAXSIZE`

-   :setting:`TWISTED_REACTOR` (ignored when using
    :class:`~scrapy.crawler.AsyncCrawlerProcess`, see below)

:setting:`ASYNCIO_EVENT_LOOP` and :setting:`TWISTED_REACTOR` are used upon
installing the reactor. The rest of the settings are applied when starting
the reactor.

There is an additional restriction for :setting:`TWISTED_REACTOR` and
:setting:`ASYNCIO_EVENT_LOOP` when using
:class:`~scrapy.crawler.AsyncCrawlerProcess`: when this class is instantiated,
it installs :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`,
ignoring the value of :setting:`TWISTED_REACTOR` and using the value of
:setting:`ASYNCIO_EVENT_LOOP` that was passed to
:meth:`AsyncCrawlerProcess.__init__()
<scrapy.crawler.AsyncCrawlerProcess.__init__>`. If a different value for
:setting:`TWISTED_REACTOR` or :setting:`ASYNCIO_EVENT_LOOP` is provided later,
e.g. in :ref:`per-spider settings <spider-settings>`, an exception will be
raised.

All of these settings, except for :setting:`ASYNCIO_EVENT_LOOP`, are only used
when the Twisted reactor is used, i.e. when :setting:`TWISTED_REACTOR_ENABLED`
is ``True``.

.. _topics-settings-ref:

Built-in settings reference
===========================

Here's a list of all available Scrapy settings, in alphabetical order, along
with their default values and the scope where they apply.

The scope, where available, shows where the setting is being used, if it's tied
to any particular component. In that case the module of that component will be
shown, typically an extension, middleware or pipeline. It also means that the
component must be enabled in order for the setting to have any effect.

.. setting:: ADDONS

ADDONS
------

Default: ``{}``

A dict containing paths to the add-ons enabled in your project and their
priorities. For more information, see :ref:`topics-addons`.

.. setting:: AWS_ACCESS_KEY_ID

AWS_ACCESS_KEY_ID
-----------------

Default: ``None``

The AWS access key used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.

.. setting:: AWS_SECRET_ACCESS_KEY

AWS_SECRET_ACCESS_KEY
---------------------

Default: ``None``

The AWS secret key used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.

.. setting:: AWS_SESSION_TOKEN

AWS_SESSION_TOKEN
-----------------

Default: ``None``

The AWS security token used by code that requires access to `Amazon Web services`_,
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`, when using
`temporary security credentials`_.

.. _temporary security credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html

.. setting:: AWS_ENDPOINT_URL

AWS_ENDPOINT_URL
----------------

Default: ``None``

Endpoint URL used for S3-like storage, for example Minio or s3.scality.

.. setting:: AWS_USE_SSL

AWS_USE_SSL
-----------

Default: ``None``

Use this option if you want to disable SSL connection for communication with
S3 or S3-like storage. By default SSL will be used.

.. setting:: AWS_VERIFY

AWS_VERIFY
----------

Default: ``None``

Verify SSL connection between Scrapy and S3 or S3-like storage. By default
SSL verification will occur.

.. setting:: AWS_REGION_NAME

AWS_REGION_NAME
---------------

Default: ``None``

The name of the region associated with the AWS client.

.. setting:: ASYNCIO_EVENT_LOOP

ASYNCIO_EVENT_LOOP
------------------

Default: ``None``

Import path of a given ``asyncio`` event loop class.

If the asyncio reactor is enabled (see :setting:`TWISTED_REACTOR`) this setting can be used to specify the
asyncio event loop to be used with it. Set the setting to the import path of the
desired asyncio event loop class. If the setting is set to ``None`` the default asyncio
event loop will be used.

If you are installing the asyncio reactor manually using the :func:`~scrapy.utils.reactor.install_reactor`
function, you can use the ``event_loop_path`` parameter to indicate the import path of the event loop
class to be used.

Note that the event loop class must inherit from :class:`asyncio.AbstractEventLoop`.

.. caution:: Please be aware that, when using a non-default event loop
    (either defined via :setting:`ASYNCIO_EVENT_LOOP` or installed with
    :func:`~scrapy.utils.reactor.install_reactor`), Scrapy will call
    :func:`asyncio.set_event_loop`, which will set the specified event loop
    as the current loop for the current OS thread.

.. setting:: BOT_NAME

BOT_NAME
--------

Default: ``<project name>`` (:ref:`fallback <default-settings>`: ``'scrapybot'``)

The name of the bot implemented by this Scrapy project (also known as the
project name). This name will be used for the logging too.

It's automatically populated with your project name when you create your
project with the :command:`startproject` command.

.. setting:: CONCURRENT_ITEMS

CONCURRENT_ITEMS
----------------

Default: ``100``

Maximum number of concurrent items (per response) to process in parallel in
:ref:`item pipelines <topics-item-pipeline>`.

.. setting:: CONCURRENT_REQUESTS

CONCURRENT_REQUESTS
-------------------

Default: ``16``

The maximum number of concurrent (i.e. simultaneous) requests that will be
performed by the Scrapy downloader.

.. setting:: CONCURRENT_REQUESTS_PER_DOMAIN

CONCURRENT_REQUESTS_PER_DOMAIN
------------------------------

Default: ``1`` (:ref:`fallback <default-settings>`: ``8``)

The maximum number of concurrent (i.e. simultaneous) requests that will be
performed to any single domain.

See also: :ref:`topics-autothrottle` and its
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.


.. setting:: DEFAULT_DROPITEM_LOG_LEVEL

DEFAULT_DROPITEM_LOG_LEVEL
--------------------------

Default: ``"WARNING"``

Default :ref:`log level <levels>` of messages about dropped items.

When an item is dropped by raising :exc:`scrapy.exceptions.DropItem` from the
:func:`process_item` method of an :ref:`item pipeline <topics-item-pipeline>`,
a message is logged, and by default its log level is the one configured in this
setting.

You may specify this log level as an integer (e.g. ``20``), as a log level
constant (e.g. ``logging.INFO``) or as a string with the name of a log level
constant (e.g. ``"INFO"``).

When writing an item pipeline, you can force a different log level by setting
:attr:`scrapy.exceptions.DropItem.log_level` in your
:exc:`scrapy.exceptions.DropItem` exception. For example:

.. code-block:: python

   from scrapy.exceptions import DropItem


   class MyPipeline:
       def process_item(self, item):
           if not item.get("price"):
               raise DropItem("Missing price data", log_level="INFO")
           return item

.. setting:: DEFAULT_ITEM_CLASS

DEFAULT_ITEM_CLASS
------------------

Default: ``'scrapy.Item'``

The default class that will be used for instantiating items in the :ref:`the
Scrapy shell <topics-shell>`.

.. setting:: DEFAULT_REQUEST_HEADERS

DEFAULT_REQUEST_HEADERS
-----------------------

Default:

.. code-block:: python

    {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en",
    }

The default headers used for Scrapy HTTP Requests. They're populated in the
:class:`~scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware`.

.. caution:: Cookies set via the ``Cookie`` header are not considered by the
    :ref:`cookies-mw`. If you need to set cookies for a request, use the
    :class:`Request.cookies <scrapy.Request>` parameter. This is a known
    current limitation that is being worked on.

.. setting:: DEPTH_LIMIT

DEPTH_LIMIT
-----------

Default: ``0``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

The maximum depth that will be allowed to crawl for any site. If zero, no limit
will be imposed.

.. setting:: DEPTH_PRIORITY

DEPTH_PRIORITY
--------------

Default: ``0``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

An integer that is used to adjust the :attr:`~scrapy.Request.priority` of
a :class:`~scrapy.Request` based on its depth.

The priority of a request is adjusted as follows:

.. skip: next
.. code-block:: python

    request.priority = request.priority - (depth * DEPTH_PRIORITY)

As depth increases, positive values of ``DEPTH_PRIORITY`` decrease request
priority (BFO), while negative values increase request priority (DFO). See
also :ref:`faq-bfo-dfo`.

.. note::

    This setting adjusts priority **in the opposite way** compared to
    other priority settings :setting:`REDIRECT_PRIORITY_ADJUST`
    and :setting:`RETRY_PRIORITY_ADJUST`.

.. setting:: DEPTH_STATS_VERBOSE

DEPTH_STATS_VERBOSE
-------------------

Default: ``False``

Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``

Whether to collect verbose depth stats. If this is enabled, the number of
requests for each depth is collected in the stats.

.. setting:: DNSCACHE_ENABLED

DNSCACHE_ENABLED
----------------

Default: ``True``

Whether to enable DNS in-memory cache.

.. note::
    This setting is only used by
    :class:`~scrapy.resolver.CachingThreadedResolver` and
    :class:`~scrapy.resolver.CachingHostnameResolver`. It has no effect when
    :setting:`TWISTED_REACTOR_ENABLED` is ``False``, and may have no effect
    either when :setting:`DNS_RESOLVER` is set to a different resolver.

.. setting:: DNSCACHE_SIZE

DNSCACHE_SIZE
-------------

Default: ``10000``

DNS in-memory cache size, see :setting:`DNSCACHE_ENABLED`.

.. setting:: TWISTED_DNS_RESOLVER

TWISTED_DNS_RESOLVER
--------------------

Default: ``'scrapy.resolver.CachingThreadedResolver'``

The class to be used by Twisted to resolve DNS names. The default
``scrapy.resolver.CachingThreadedResolver`` supports specifying a timeout for
DNS requests via the :setting:`DNS_TIMEOUT` setting, but works only with IPv4
addresses. Scrapy provides an alternative resolver,
``scrapy.resolver.CachingHostnameResolver``, which supports IPv4/IPv6 addresses but does not
take the :setting:`DNS_TIMEOUT` setting into account.

.. note::
    This setting has no effect when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

.. setting:: DNS_TIMEOUT

DNS_TIMEOUT
-----------

Default: ``60``

Timeout for processing of DNS queries in seconds. Float is supported.

.. note::
    This setting is only used by
    :class:`~scrapy.resolver.CachingThreadedResolver`. It has no effect when
    :setting:`TWISTED_REACTOR_ENABLED` is ``False``, and may have no effect
    either when :setting:`DNS_RESOLVER` is set to a different resolver.

.. setting:: DOWNLOADER

DOWNLOADER
----------

Default: ``'scrapy.core.downloader.Downloader'``

The downloader to use for crawling.

.. setting:: DOWNLOADER_CLIENT_TLS_CIPHERS

DOWNLOADER_CLIENT_TLS_CIPHERS
-----------------------------

Default: ``'DEFAULT'``

Use this setting to customize the TLS/SSL ciphers used by the HTTPS download
handler.

The setting should contain a string in the `OpenSSL cipher list format`_,
these ciphers will be used as client ciphers. Changing this setting may be
necessary to access certain HTTPS websites: for example, you may need to use
``'DEFAULT:!DH'`` for a website with weak DH parameters or enable a
specific cipher that is not included in ``DEFAULT`` if a website requires it.

.. _OpenSSL cipher list format: https://docs.openssl.org/master/man1/openssl-ciphers/#cipher-list-format

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. It's currently unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOADER_CLIENT_TLS_METHOD

DOWNLOADER_CLIENT_TLS_METHOD
----------------------------

Default: ``'TLS'``

Use this setting to customize the TLS/SSL method used by the HTTPS download
handler.

This setting must be one of these string values:

- ``'TLS'``: maps to OpenSSL's ``TLS_method()`` (a.k.a ``SSLv23_method()``),
  which allows protocol negotiation, starting from the highest supported
  by the platform; **default, recommended**
- ``'TLSv1.0'``: this value forces HTTPS connections to use TLS version 1.0 ;
  set this if you want the behavior of Scrapy<1.1
- ``'TLSv1.1'``: forces TLS version 1.1
- ``'TLSv1.2'``: forces TLS version 1.2

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. It's currently unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING

DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING
-------------------------------------

Default: ``False``

Setting this to ``True`` will enable DEBUG level messages about TLS connection
parameters after establishing HTTPS connections. The kind of information logged
depends on the implementation of the download handler and the versions of
the TLS-related libraries.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES
----------------------

Default: ``{}``

A dict containing the downloader middlewares enabled in your project, and their
orders. For more info see :ref:`topics-downloader-middleware-setting`.

.. setting:: DOWNLOADER_MIDDLEWARES_BASE

DOWNLOADER_MIDDLEWARES_BASE
---------------------------

Default:

.. code-block:: python

    {
        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
        "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
        "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
        "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
        "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
        "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
        "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
        "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
        "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
        "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
        "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
        "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
        "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
        "scrapy.downloadermiddlewares.stats.DownloaderStats": 850,
        "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900,
    }

A dict containing the downloader middlewares enabled by default in Scrapy. Low
orders are closer to the engine, high orders are closer to the downloader. You
should never modify this setting in your project, modify
:setting:`DOWNLOADER_MIDDLEWARES` instead.  For more info see
:ref:`topics-downloader-middleware-setting`.

.. setting:: DOWNLOADER_STATS

DOWNLOADER_STATS
----------------

Default: ``True``

Whether to enable downloader stats collection.

.. setting:: DOWNLOAD_DELAY

DOWNLOAD_DELAY
--------------

Default: ``1`` (:ref:`fallback <default-settings>`: ``0``)

Minimum seconds to wait between 2 consecutive requests to the same domain.

Use :setting:`DOWNLOAD_DELAY` to throttle your crawling speed, to avoid hitting
servers too hard.

Decimal numbers are supported. For example, to send a maximum of 4 requests
every 10 seconds::

    DOWNLOAD_DELAY = 2.5

This setting is also affected by the :setting:`RANDOMIZE_DOWNLOAD_DELAY`
setting, which is enabled by default.

Note that :setting:`DOWNLOAD_DELAY` can lower the effective per-domain
concurrency below :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`. If the response
time of a domain is lower than :setting:`DOWNLOAD_DELAY`, the effective
concurrency for that domain is 1. When testing throttling configurations, it
usually makes sense to lower :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` first,
and only increase :setting:`DOWNLOAD_DELAY` once
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` is 1 but a higher throttling is
desired.

.. _spider-download_delay-attribute:

.. note::

    This delay can be set per spider using :attr:`download_delay` spider attribute.

It is also possible to change this setting per domain, although it requires
non-trivial code. See the implementation of the :ref:`AutoThrottle
<topics-autothrottle>` extension for an example.

.. setting:: DOWNLOAD_BIND_ADDRESS

DOWNLOAD_BIND_ADDRESS
---------------------

Default: ``None``

The default local outgoing address for download-handler connections.

This setting can be either:

- a host address as a string (e.g. ``"127.0.0.2"``), in which case the local
  port is chosen automatically, or

- a ``(host, port)`` tuple (e.g. ``("127.0.0.2", 50000)``) to bind to both a
  specific local interface and a specific local port.

For example:

.. code-block:: python

    # Bind to this local address
    DOWNLOAD_BIND_ADDRESS = "127.0.0.2"

.. code-block:: python

    # Bind to this local address and local port
    DOWNLOAD_BIND_ADDRESS = ("127.0.0.2", 5000)

If set, built-in HTTP download handlers use this value by default.
Set the :reqmeta:`bindaddress` request meta key to override it for a specific
request.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. Specifying the port is unsupported by
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`.

.. setting:: DOWNLOAD_HANDLERS

DOWNLOAD_HANDLERS
-----------------

Default: ``{}``

A dict containing the :ref:`download handlers <topics-download-handlers>`
enabled in your project.

See :setting:`DOWNLOAD_HANDLERS_BASE` for example format.

.. setting:: DOWNLOAD_HANDLERS_BASE

DOWNLOAD_HANDLERS_BASE
----------------------

Default:

.. code-block:: python

    {
        "data": "scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler",
        "file": "scrapy.core.downloader.handlers.file.FileDownloadHandler",
        "http": "scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler",
        "https": "scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler",
        "s3": "scrapy.core.downloader.handlers.s3.S3DownloadHandler",
        "ftp": "scrapy.core.downloader.handlers.ftp.FTPDownloadHandler",
    }

(when :setting:`TWISTED_REACTOR_ENABLED` is ``True``)

.. code-block:: python

    {
        "data": "scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler",
        "file": "scrapy.core.downloader.handlers.file.FileDownloadHandler",
        "http": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
        "https": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
        "s3": "scrapy.core.downloader.handlers.s3.S3DownloadHandler",
        "ftp": None,
    }

(when :setting:`TWISTED_REACTOR_ENABLED` is ``False``)

A dict containing the :ref:`download handlers <topics-download-handlers>`
enabled by default in Scrapy. You should never modify this setting in your
project, modify :setting:`DOWNLOAD_HANDLERS` instead.

You can disable any of these download handlers by assigning ``None`` to their
URI scheme in :setting:`DOWNLOAD_HANDLERS`. E.g., to disable the built-in FTP
handler (without replacement), place this in your ``settings.py``:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        "ftp": None,
    }


.. setting:: DOWNLOAD_SLOTS

DOWNLOAD_SLOTS
--------------

Default: ``{}``

Allows to define concurrency/delay parameters on per slot (domain) basis:

    .. code-block:: python

        DOWNLOAD_SLOTS = {
            "quotes.toscrape.com": {"concurrency": 1, "delay": 2, "randomize_delay": False},
            "books.toscrape.com": {"delay": 3, "randomize_delay": False},
        }

.. note::

    For other downloader slots default settings values will be used:

    -   :setting:`DOWNLOAD_DELAY`: ``delay``
    -   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`: ``concurrency``
    -   :setting:`RANDOMIZE_DOWNLOAD_DELAY`: ``randomize_delay``


.. setting:: DOWNLOAD_TIMEOUT

DOWNLOAD_TIMEOUT
----------------

Default: ``180``

The amount of time (in secs) that the downloader will wait before timing out.

.. note::

    This timeout can be per-request using the :reqmeta:`download_timeout`
    :attr:`.Request.meta` key.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: DOWNLOAD_MAXSIZE
.. reqmeta:: download_maxsize

DOWNLOAD_MAXSIZE
----------------

Default: ``1073741824`` (1 GiB)

The maximum response body size (in bytes) allowed. Bigger responses are
aborted and ignored.

This applies both before and after compression. If decompressing a response
body would exceed this limit, decompression is aborted and the response is
ignored.

Use ``0`` to disable this limit.

.. note::

    This limit can be set per-request using the :reqmeta:`download_maxsize`
    :attr:`.Request.meta` key.

.. note::

    Checking responses before decompressing them needs to be implemented inside
    the :ref:`download handler <topics-download-handlers>`, so it's not
    guaranteed to be supported by all 3rd-party handlers.

.. setting:: DOWNLOAD_WARNSIZE
.. reqmeta:: download_warnsize

DOWNLOAD_WARNSIZE
-----------------

Default: ``33554432`` (32 MiB)

If the size of a response exceeds this value, before or after compression, a
warning will be logged about it.

Use ``0`` to disable this limit.

.. note::

    This limit can be set per-request using the :reqmeta:`download_warnsize`
    :attr:`.Request.meta` key.

.. note::

    Checking responses before decompressing them needs to be implemented inside
    the :ref:`download handler <topics-download-handlers>`, so it's not
    guaranteed to be supported by all 3rd-party handlers.

.. setting:: DOWNLOAD_FAIL_ON_DATALOSS

DOWNLOAD_FAIL_ON_DATALOSS
-------------------------

Default: ``True``

Whether or not to fail on broken responses, that is, when the declared
``Content-Length`` does not match content sent by the server or a chunked
response was not properly finished. If ``True``, these responses raise a
:exc:`~scrapy.exceptions.ResponseDataLossError` exception. If ``False``, these
responses are passed through and the flag ``dataloss`` is added to the
response, i.e.: ``'dataloss' in response.flags`` is ``True``.

Optionally, this can be set per-request basis by using the
:reqmeta:`download_fail_on_dataloss` Request.meta key to ``False``.

.. note::

  A broken response, or data loss error, may happen under several
  circumstances, from server misconfiguration to network errors to data
  corruption. It is up to the user to decide if it makes sense to process
  broken responses considering they may contain partial or incomplete content.
  If :setting:`RETRY_ENABLED` is ``True`` and this setting is set to ``True``,
  the :exc:`~scrapy.exceptions.ResponseDataLossError` failure will be retried
  as usual.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. warning::

    This setting is ignored by the
    :class:`~scrapy.core.downloader.handlers.http2.H2DownloadHandler`
    :ref:`download handler <topics-download-handlers>`. In case of a data loss
    error, the corresponding HTTP/2 connection may be corrupted, affecting other
    requests that use the same connection; hence, a ``ResponseFailed([InvalidBodyLengthError])``
    failure is always raised for every request that was using that connection.

.. setting:: DOWNLOAD_VERIFY_CERTIFICATES

DOWNLOAD_VERIFY_CERTIFICATES
----------------------------

Default: ``False``

Whether the HTTPS download handlers should verify the server TLS certificate
when making a request and abort the request if the verification fails.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers. The exact behavior of a handler (e.g. whether
    certificate problems are logged when this setting is set to ``False``)
    depends on its implementation.

.. setting:: DUPEFILTER_CLASS

DUPEFILTER_CLASS
----------------

Default: ``'scrapy.dupefilters.RFPDupeFilter'``

The class used to detect and filter duplicate requests.

The default, :class:`~scrapy.dupefilters.RFPDupeFilter`, filters based on the
:setting:`REQUEST_FINGERPRINTER_CLASS` setting.

To change how duplicates are checked, you can point :setting:`DUPEFILTER_CLASS`
to a custom subclass of :class:`~scrapy.dupefilters.RFPDupeFilter` that
overrides its ``__init__`` method to use a :ref:`different request
fingerprinting class <custom-request-fingerprinter>`. For example:

.. code-block:: python

    from scrapy.dupefilters import RFPDupeFilter
    from scrapy.utils.request import fingerprint


    class CustomRequestFingerprinter:
        def fingerprint(self, request):
            return fingerprint(request, include_headers=["X-ID"])


    class CustomDupeFilter(RFPDupeFilter):

        def __init__(self, path=None, debug=False, *, fingerprinter=None):
            super().__init__(
                path=path, debug=debug, fingerprinter=CustomRequestFingerprinter()
            )

To disable duplicate request filtering set :setting:`DUPEFILTER_CLASS` to
``'scrapy.dupefilters.BaseDupeFilter'``. Note that not filtering out duplicate
requests may cause crawling loops. It is usually better to set
the ``dont_filter`` parameter to ``True`` on the ``__init__`` method of a
specific :class:`~scrapy.Request` object that should not be filtered out.

A class assigned to :setting:`DUPEFILTER_CLASS` must implement the following
interface::

    class MyDupeFilter:

        @classmethod
        def from_crawler(cls, crawler):
            """Returns an instance of this duplicate request filtering class
            based on the current Crawler instance."""
            return cls()

        def request_seen(self, request):
            """Returns ``True`` if *request* is a duplicate of another request
            seen in a previous call to :meth:`request_seen`, or ``False``
            otherwise."""
            return False

        def open(self):
            """Called before the spider opens. It may return a deferred."""
            pass

        def close(self, reason):
            """Called before the spider closes. It may return a deferred."""
            pass

        def log(self, request, spider):
            """Logs that a request has been filtered out.

            It is called right after a call to :meth:`request_seen` that
            returns ``True``.

            If :meth:`request_seen` always returns ``False``, such as in the
            case of :class:`~scrapy.dupefilters.BaseDupeFilter`, this method
            may be omitted.
            """
            pass

.. autoclass:: scrapy.dupefilters.BaseDupeFilter

.. autoclass:: scrapy.dupefilters.RFPDupeFilter


.. setting:: DUPEFILTER_DEBUG

DUPEFILTER_DEBUG
----------------

Default: ``False``

By default, ``RFPDupeFilter`` only logs the first duplicate request.
Setting :setting:`DUPEFILTER_DEBUG` to ``True`` will make it log all duplicate requests.

.. setting:: EDITOR

EDITOR
------

Default: ``vi`` (on Unix systems) or the IDLE editor (on Windows)

The editor to use for editing spiders with the :command:`edit` command.
Additionally, if the ``EDITOR`` environment variable is set, the :command:`edit`
command will prefer it over the default setting.

.. setting:: EXTENSIONS

EXTENSIONS
----------

Default: ``{}``

:ref:`Component priority dictionary <component-priority-dictionaries>` of
enabled extensions. See :ref:`topics-extensions`.

.. setting:: EXTENSIONS_BASE

EXTENSIONS_BASE
---------------

Default:

.. code-block:: python

    {
        "scrapy.extensions.corestats.CoreStats": 0,
        "scrapy.extensions.telnet.TelnetConsole": 0,
        "scrapy.extensions.memusage.MemoryUsage": 0,
        "scrapy.extensions.memdebug.MemoryDebugger": 0,
        "scrapy.extensions.closespider.CloseSpider": 0,
        "scrapy.extensions.feedexport.FeedExporter": 0,
        "scrapy.extensions.logstats.LogStats": 0,
        "scrapy.extensions.spiderstate.SpiderState": 0,
        "scrapy.extensions.throttle.AutoThrottle": 0,
    }

A dict containing the extensions available by default in Scrapy, and their
orders. This setting contains all stable built-in extensions. Keep in mind that
some of them need to be enabled through a setting.

For more information See the :ref:`extensions user guide  <topics-extensions>`
and the :ref:`list of available extensions <topics-extensions-ref>`.

.. setting:: FEED_TEMPDIR

FEED_TEMPDIR
------------

The Feed Temp dir allows you to set a custom folder to save crawler
temporary files before uploading with :ref:`FTP feed storage <topics-feed-storage-ftp>` and
:ref:`Amazon S3 <topics-feed-storage-s3>`.

.. setting:: FEED_STORAGE_GCS_ACL

FEED_STORAGE_GCS_ACL
--------------------

The Access Control List (ACL) used when storing items to :ref:`Google Cloud Storage <topics-feed-storage-gcs>`.
For more information on how to set this value, please refer to the column *JSON API* in `Google Cloud documentation <https://docs.cloud.google.com/storage/docs/access-control/lists>`_.

.. setting:: FORCE_CRAWLER_PROCESS

FORCE_CRAWLER_PROCESS
---------------------

Default: ``False``

If ``False``, :ref:`Scrapy commands that need a CrawlerProcess
<topics-commands-crawlerprocess>` will decide between using
:class:`scrapy.crawler.AsyncCrawlerProcess` and
:class:`scrapy.crawler.CrawlerProcess` based on the value of the
:setting:`TWISTED_REACTOR` setting, but ignoring its value in :ref:`per-spider
settings <spider-settings>`.

If ``True``, these commands will always use
:class:`~scrapy.crawler.CrawlerProcess`.

Set this to ``True`` if you want to set :setting:`TWISTED_REACTOR` to a
non-default value in :ref:`per-spider settings <spider-settings>`.

.. setting:: FTP_PASSIVE_MODE

FTP_PASSIVE_MODE
----------------

Default: ``True``

Whether or not to use passive mode when initiating FTP transfers.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. reqmeta:: ftp_password
.. setting:: FTP_PASSWORD

FTP_PASSWORD
------------

Default: ``"guest"``

The password to use for FTP connections when there is no ``"ftp_password"``
in ``Request`` meta.

.. note::
    Paraphrasing `RFC 1635`_, although it is common to use either the password
    "guest" or one's e-mail address for anonymous FTP,
    some FTP servers explicitly ask for the user's e-mail address
    and will not allow login with the "guest" password.

.. _RFC 1635: https://datatracker.ietf.org/doc/html/rfc1635

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. reqmeta:: ftp_user
.. setting:: FTP_USER

FTP_USER
--------

Default: ``"anonymous"``

The username to use for FTP connections when there is no ``"ftp_user"``
in ``Request`` meta.

.. note::

    Handling of this setting needs to be implemented inside the :ref:`download
    handler <topics-download-handlers>`, so it's not guaranteed to be supported
    by all 3rd-party handlers.

.. setting:: GCS_PROJECT_ID

GCS_PROJECT_ID
-----------------

Default: ``None``

The Project ID that will be used when storing data on `Google Cloud Storage`_.

.. setting:: ITEM_PIPELINES

ITEM_PIPELINES
--------------

Default: ``{}``

A dict containing the item pipelines to use, and their orders. Order values are
arbitrary, but it is customary to define them in the 0-1000 range. Lower orders
process before higher orders.

Example:

.. code-block:: python

   ITEM_PIPELINES = {
       "mybot.pipelines.validate.ValidateMyItem": 300,
       "mybot.pipelines.validate.StoreMyItem": 800,
   }

.. setting:: ITEM_PIPELINES_BASE

ITEM_PIPELINES_BASE
-------------------

Default: ``{}``

A dict containing the pipelines enabled by default in Scrapy. You should never
modify this setting in your project, modify :setting:`ITEM_PIPELINES` instead.


.. setting:: JOBDIR

JOBDIR
------

Default: ``None``

A string indicating the directory for storing the state of a crawl when
:ref:`pausing and resuming crawls <topics-jobs>`.


.. setting:: LOG_ENABLED

LOG_ENABLED
-----------

Default: ``True``

Whether to enable logging.

.. setting:: LOG_ENCODING

LOG_ENCODING
------------

Default: ``'utf-8'``

The encoding to use for logging.

.. setting:: LOG_FILE

LOG_FILE
--------

Default: ``None``

File name to use for logging output. If ``None``, standard error will be used.

.. setting:: LOG_FILE_APPEND

LOG_FILE_APPEND
---------------

Default: ``True``

If ``False``, the log file specified with :setting:`LOG_FILE` will be
overwritten (discarding the output from previous runs, if any).

.. setting:: LOG_FORMAT

LOG_FORMAT
----------

Default: ``'%(asctime)s [%(name)s] %(levelname)s: %(message)s'``

String for formatting log messages. Refer to the
:ref:`Python logging documentation <logrecord-attributes>` for the whole
list of available placeholders.

.. setting:: LOG_DATEFORMAT

LOG_DATEFORMAT
--------------

Default: ``'%Y-%m-%d %H:%M:%S'``

String for formatting date/time, expansion of the ``%(asctime)s`` placeholder
in :setting:`LOG_FORMAT`. Refer to the
:ref:`Python datetime documentation <strftime-strptime-behavior>` for the
whole list of available directives.

.. setting:: LOG_FORMATTER

LOG_FORMATTER
-------------

Default: :class:`scrapy.logformatter.LogFormatter`

The class to use for :ref:`formatting log messages <custom-log-formats>` for different actions.

.. setting:: LOG_LEVEL

LOG_LEVEL
---------

Default: ``'DEBUG'``

Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING,
INFO, DEBUG. For more info see :ref:`topics-logging`.

.. setting:: LOG_STDOUT

LOG_STDOUT
----------

Default: ``False``

If ``True``, all standard output (and error) of your process will be redirected
to the log. For example if you ``print('hello')`` it will appear in the Scrapy
log.

.. setting:: LOG_SHORT_NAMES

LOG_SHORT_NAMES
---------------

Default: ``False``

If ``True``, the logs will just contain the root path. If it is set to ``False``
then it displays the component responsible for the log output

.. setting:: LOG_VERSIONS

LOG_VERSIONS
------------

Default: ``["lxml", "libxml2", "cssselect", "parsel", "w3lib", "Twisted", "Python", "pyOpenSSL", "cryptography", "Platform"]``

Logs the installed versions of the specified items.

An item can be any installed Python package.

The following special items are also supported:

-   ``libxml2``

-   ``Platform`` (:func:`platform.platform`)

-   ``Python``

.. setting:: LOGSTATS_INTERVAL

LOGSTATS_INTERVAL
-----------------

Default: ``60.0``

The interval (in seconds) between each logging printout of the stats
by :class:`~scrapy.extensions.logstats.LogStats`.

.. setting:: MEMDEBUG_ENABLED

MEMDEBUG_ENABLED
----------------

Default: ``False``

Whether to enable memory debugging.

.. setting:: MEMDEBUG_NOTIFY

MEMDEBUG_NOTIFY
---------------

Default: ``[]``

When memory debugging is enabled a memory report will be sent to the specified
addresses if this setting is not empty, otherwise the report will be written to
the log.

Example::

    MEMDEBUG_NOTIFY = ['user@example.com']

.. setting:: MEMUSAGE_ENABLED

MEMUSAGE_ENABLED
----------------

Default: ``True``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

Whether to enable the memory usage extension. This extension keeps track of
a peak memory used by the process (it writes it to stats). It can also
optionally shutdown the Scrapy process when it exceeds a memory limit
(see :setting:`MEMUSAGE_LIMIT_MB`).

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_LIMIT_MB

MEMUSAGE_LIMIT_MB
-----------------

Default: ``0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The maximum amount of memory to allow (in megabytes) before shutting down
Scrapy (if :setting:`MEMUSAGE_ENABLED` is ``True``). If zero, no check will be
performed.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_CHECK_INTERVAL_SECONDS

MEMUSAGE_CHECK_INTERVAL_SECONDS
-------------------------------

Default: ``60.0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
checks the current memory usage, versus the limits set by
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
at fixed time intervals.

This sets the length of these intervals, in seconds.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: MEMUSAGE_WARNING_MB

MEMUSAGE_WARNING_MB
-------------------

Default: ``0``

Scope: ``scrapy.extensions.memusage.MemoryUsage``

The maximum amount of memory to allow (in megabytes) before sending a
:signal:`memusage_warning_reached` signal (if :setting:`MEMUSAGE_ENABLED` is
``True``). If zero, no signal will be sent.

See :ref:`topics-extensions-ref-memusage`.

.. setting:: NEWSPIDER_MODULE

NEWSPIDER_MODULE
----------------

Default: ``"<project name>.spiders"`` (:ref:`fallback <default-settings>`: ``""``)

Module where to create new spiders using the :command:`genspider` command.

Example::

    NEWSPIDER_MODULE = 'mybot.spiders_dev'

.. setting:: RANDOMIZE_DOWNLOAD_DELAY

RANDOMIZE_DOWNLOAD_DELAY
------------------------

Default: ``True``

If enabled, Scrapy will wait a random amount of time (between 0.5 * :setting:`DOWNLOAD_DELAY` and 1.5 * :setting:`DOWNLOAD_DELAY`) while fetching requests from the same
website.

This randomization decreases the chance of the crawler being detected (and
subsequently blocked) by sites which analyze requests looking for statistically
significant similarities in the time between their requests.

The randomization policy is the same used by `wget`_ ``--random-wait`` option.

If :setting:`DOWNLOAD_DELAY` is zero (default) this option has no effect.

.. _wget: https://www.gnu.org/software/wget/manual/wget.html

.. setting:: REACTOR_THREADPOOL_MAXSIZE

REACTOR_THREADPOOL_MAXSIZE
--------------------------

Default: ``10``

The maximum limit for Twisted Reactor thread pool size. This is common
multi-purpose thread pool used by various Scrapy components. Threaded
DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase
this value if you're experiencing problems with insufficient blocking IO.

.. setting:: REDIRECT_PRIORITY_ADJUST

REDIRECT_PRIORITY_ADJUST
------------------------

Default: ``+2``

Scope: ``scrapy.downloadermiddlewares.redirect.RedirectMiddleware``

Adjust redirect request priority relative to original request:

- **a positive priority adjust (default) means higher priority.**
- a negative priority adjust means lower priority.

.. setting:: ROBOTSTXT_OBEY

ROBOTSTXT_OBEY
--------------

Default: ``True`` (:ref:`fallback <default-settings>`: ``False``)

If enabled, Scrapy will respect robots.txt policies. For more information see
:ref:`topics-dlmw-robots`.

.. note::

    While the default value is ``False`` for historical reasons,
    this option is enabled by default in settings.py file generated
    by ``scrapy startproject`` command.

.. setting:: ROBOTSTXT_PARSER

ROBOTSTXT_PARSER
----------------

Default: ``'scrapy.robotstxt.ProtegoRobotParser'``

The parser backend to use for parsing ``robots.txt`` files. For more information see
:ref:`topics-dlmw-robots`.

.. setting:: ROBOTSTXT_USER_AGENT

ROBOTSTXT_USER_AGENT
^^^^^^^^^^^^^^^^^^^^

Default: ``None``

The user agent string to use for matching in the robots.txt file. If ``None``,
the User-Agent header you are sending with the request or the
:setting:`USER_AGENT` setting (in that order) will be used for determining
the user agent to use in the robots.txt file.

.. setting:: SCHEDULER

SCHEDULER
---------

Default: :class:`~scrapy.core.scheduler.Scheduler`

The scheduler class to be used for crawling. See :ref:`topics-scheduler` for
details.

.. setting:: SCHEDULER_DEBUG

SCHEDULER_DEBUG
---------------

Default: ``False``

Setting to ``True`` will log debug information about the requests scheduler.
This currently logs (only once) if the requests cannot be serialized to disk.
Stats counter (``scheduler/unserializable``) tracks the number of times this happens.

Example entry in logs::

    1956-01-31 00:00:00+0800 [scrapy.core.scheduler] ERROR: Unable to serialize request:
    <GET http://example.com> - reason: cannot serialize <Request at 0x9a7c7ec>
    (type Request)> - no more unserializable requests will be logged
    (see 'scheduler/unserializable' stats counter)


.. setting:: SCHEDULER_DISK_QUEUE

SCHEDULER_DISK_QUEUE
--------------------

Default: ``'scrapy.squeues.PickleLifoDiskQueue'``

Type of disk queue that will be used by the scheduler. Other available types
are ``scrapy.squeues.PickleFifoDiskQueue``,
``scrapy.squeues.MarshalFifoDiskQueue``,
``scrapy.squeues.MarshalLifoDiskQueue``.


.. setting:: SCHEDULER_MEMORY_QUEUE

SCHEDULER_MEMORY_QUEUE
----------------------

Default: ``'scrapy.squeues.LifoMemoryQueue'``

Type of in-memory queue used by the scheduler. Other available type is:
``scrapy.squeues.FifoMemoryQueue``.


.. setting:: SCHEDULER_PRIORITY_QUEUE

SCHEDULER_PRIORITY_QUEUE
------------------------

Default: :class:`~scrapy.pqueues.DownloaderAwarePriorityQueue`

Type of priority queue used by the scheduler.

Another available type is :class:`~scrapy.pqueues.ScrapyPriorityQueue`.

:class:`~scrapy.pqueues.DownloaderAwarePriorityQueue` works better than
:class:`~scrapy.pqueues.ScrapyPriorityQueue` when you crawl many different
domains in parallel.


.. setting:: SCHEDULER_START_DISK_QUEUE

SCHEDULER_START_DISK_QUEUE
--------------------------

Default: ``'scrapy.squeues.PickleFifoDiskQueue'``

Type of disk queue (see :setting:`JOBDIR`) that the :ref:`scheduler
<topics-scheduler>` uses for :ref:`start requests <start-requests>`.

For available choices, see :setting:`SCHEDULER_DISK_QUEUE`.

.. queue-common-starts

Use ``None`` or ``""`` to disable these separate queues entirely, and instead
have start requests share the same queues as other requests.

.. note::

    Disabling separate start request queues makes :ref:`start request order
    <start-request-order>` unintuitive: start requests will be sent in order
    only until :setting:`CONCURRENT_REQUESTS` is reached, then remaining start
    requests will be sent in reverse order.

.. queue-common-ends


.. setting:: SCHEDULER_START_MEMORY_QUEUE

SCHEDULER_START_MEMORY_QUEUE
----------------------------

Default: ``'scrapy.squeues.FifoMemoryQueue'``

Type of in-memory queue that the :ref:`scheduler <topics-scheduler>` uses for
:ref:`start requests <start-requests>`.

For available choices, see :setting:`SCHEDULER_MEMORY_QUEUE`.

.. include:: settings.rst
    :start-after: queue-common-starts
    :end-before: queue-common-ends


.. setting:: SCRAPER_SLOT_MAX_ACTIVE_SIZE

SCRAPER_SLOT_MAX_ACTIVE_SIZE
----------------------------

Default: ``5_000_000``

Soft limit (in bytes) for response data being processed.

While the sum of the sizes of all responses being processed is above this value,
Scrapy does not process new requests.

.. setting:: SPIDER_CONTRACTS

SPIDER_CONTRACTS
----------------

Default: ``{}``

A dict containing the spider contracts enabled in your project, used for
testing spiders. For more info see :ref:`topics-contracts`.

.. setting:: SPIDER_CONTRACTS_BASE

SPIDER_CONTRACTS_BASE
---------------------

Default:

.. code-block:: python

    {
        "scrapy.contracts.default.UrlContract": 1,
        "scrapy.contracts.default.ReturnsContract": 2,
        "scrapy.contracts.default.ScrapesContract": 3,
    }

A dict containing the Scrapy contracts enabled by default in Scrapy. You should
never modify this setting in your project, modify :setting:`SPIDER_CONTRACTS`
instead. For more info see :ref:`topics-contracts`.

You can disable any of these contracts by assigning ``None`` to their class
path in :setting:`SPIDER_CONTRACTS`. E.g., to disable the built-in
``ScrapesContract``, place this in your ``settings.py``:

.. code-block:: python

    SPIDER_CONTRACTS = {
        "scrapy.contracts.default.ScrapesContract": None,
    }

.. setting:: SPIDER_LOADER_CLASS

SPIDER_LOADER_CLASS
-------------------

Default: ``'scrapy.spiderloader.SpiderLoader'``

The class that will be used for loading spiders, which must implement the
:ref:`topics-api-spiderloader`.

.. setting:: SPIDER_LOADER_WARN_ONLY

SPIDER_LOADER_WARN_ONLY
-----------------------

Default: ``False``

By default, when Scrapy tries to import spider classes from :setting:`SPIDER_MODULES`,
it will fail loudly if there is any ``ImportError`` or ``SyntaxError`` exception.
But you can choose to silence this exception and turn it into a simple
warning by setting ``SPIDER_LOADER_WARN_ONLY = True``.

.. setting:: SPIDER_MIDDLEWARES

SPIDER_MIDDLEWARES
------------------

Default: ``{}``

A dict containing the spider middlewares enabled in your project, and their
orders. For more info see :ref:`topics-spider-middleware-setting`.

.. setting:: SPIDER_MIDDLEWARES_BASE

SPIDER_MIDDLEWARES_BASE
-----------------------

Default:

.. code-block:: python

    {
        "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
        "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
        "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
        "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
    }

A dict containing the spider middlewares enabled by default in Scrapy, and
their orders. Low orders are closer to the engine, high orders are closer to
the spider. For more info see :ref:`topics-spider-middleware-setting`.

.. setting:: SPIDER_MODULES

SPIDER_MODULES
--------------

Default: ``["<project name>.spiders"]`` (:ref:`fallback <default-settings>`: ``[]``)

A list of modules where Scrapy will look for spiders.

Example:

.. code-block:: python

    SPIDER_MODULES = ["mybot.spiders_prod", "mybot.spiders_dev"]

.. setting:: STATS_CLASS

STATS_CLASS
-----------

Default: ``'scrapy.statscollectors.MemoryStatsCollector'``

The class to use for collecting stats, who must implement the
:ref:`topics-api-stats`.

.. setting:: STATS_DUMP

STATS_DUMP
----------

Default: ``True``

Dump the :ref:`Scrapy stats <topics-stats>` (to the Scrapy log) once the spider
finishes.

For more info see: :ref:`topics-stats`.

.. setting:: TELNETCONSOLE_ENABLED

TELNETCONSOLE_ENABLED
---------------------

Default: ``True`` (``False`` when :setting:`TWISTED_REACTOR_ENABLED` is ``False``)

A boolean which specifies if the :ref:`telnet console <topics-telnetconsole>`
will be enabled (provided its extension is also enabled).

.. setting:: TEMPLATES_DIR

TEMPLATES_DIR
-------------

Default: ``templates`` dir inside scrapy module

The directory where to look for templates when creating new projects with
:command:`startproject` command and new spiders with :command:`genspider`
command.

The project name must not conflict with the name of custom files or directories
in the ``project`` subdirectory.

.. setting:: TWISTED_REACTOR_ENABLED

TWISTED_REACTOR_ENABLED
-----------------------

Default: ``True``

Whether to install and use the Twisted reactor.

If this is set to ``True``, Scrapy will use the Twisted reactor and will
install one according to the :setting:`TWISTED_REACTOR` setting value when
appropriate (e.g. when running via :ref:`the command-line tool
<topics-commands>`). This is the traditional mode of using Scrapy.

If this is set to ``False``, Scrapy will use the asyncio event loop directly
and will not attempt to install or use a reactor. Features that require a
reactor won't be available, but Twisted APIs that don't require a reactor,
including :class:`~twisted.internet.defer.Deferred` and
:class:`~twisted.python.failure.Failure`, will still be available. On the other
hand, limitations related to Twisted reactors (such as not being able to start
a reactor in the same process where a reactor was previously started and
stopped) will not apply. This mode is currently experimental and may not be
suitable for production use. It may also not be supported by 3rd-party code.
See :ref:`asyncio-without-reactor` for more information about this mode.

.. note:: This setting can't be set :ref:`per-spider <spider-settings>`.

.. versionadded:: 2.15.0

.. setting:: TWISTED_REACTOR

TWISTED_REACTOR
---------------

Default: ``"twisted.internet.asyncioreactor.AsyncioSelectorReactor"``

Import path of a given :mod:`~twisted.internet.reactor`.

Scrapy will install this reactor if no other reactor is installed yet, such as
when the ``scrapy`` CLI program is invoked or when using the
:class:`~scrapy.crawler.AsyncCrawlerProcess` class or the
:class:`~scrapy.crawler.CrawlerProcess` class.

If you are using the :class:`~scrapy.crawler.AsyncCrawlerRunner` class or the
:class:`~scrapy.crawler.CrawlerRunner` class, you also
need to install the correct reactor manually. You can do that using
:func:`~scrapy.utils.reactor.install_reactor`:

.. autofunction:: scrapy.utils.reactor.install_reactor

If a reactor is already installed,
:func:`~scrapy.utils.reactor.install_reactor` has no effect.

:class:`~scrapy.crawler.AsyncCrawlerRunner` and other similar classes raise an
exception if the installed reactor does not match the
:setting:`TWISTED_REACTOR` setting; therefore, having top-level
:mod:`~twisted.internet.reactor` imports in project files and imported
third-party libraries will make Scrapy raise an exception when it checks which
reactor is installed.

In order to use the reactor installed by Scrapy:

.. skip: next
.. code-block:: python

    import scrapy
    from twisted.internet import reactor


    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        def __init__(self, *args, **kwargs):
            self.timeout = int(kwargs.pop("timeout", "60"))
            super(QuotesSpider, self).__init__(*args, **kwargs)

        async def start(self):
            reactor.callLater(self.timeout, self.stop)

            urls = ["https://quotes.toscrape.com/page/1"]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {"text": quote.css("span.text::text").get()}

        def stop(self):
            self.crawler.engine.close_spider(self, "timeout")


which raises an exception, becomes:

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        def __init__(self, *args, **kwargs):
            self.timeout = int(kwargs.pop("timeout", "60"))
            super(QuotesSpider, self).__init__(*args, **kwargs)

        async def start(self):
            from twisted.internet import reactor

            reactor.callLater(self.timeout, self.stop)

            urls = ["https://quotes.toscrape.com/page/1"]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {"text": quote.css("span.text::text").get()}

        def stop(self):
            self.crawler.engine.close_spider(self, "timeout")


If this setting is set ``None``, Scrapy will use the existing reactor if one is
already installed, or install the default reactor defined by Twisted for the
current platform.

.. versionchanged:: 2.13
   The default value was changed from ``None`` to
   ``"twisted.internet.asyncioreactor.AsyncioSelectorReactor"``.

For additional information, see :doc:`core/howto/choosing-reactor`.


.. setting:: URLLENGTH_LIMIT

URLLENGTH_LIMIT
---------------

Default: ``2083``

Scope: ``spidermiddlewares.urllength``

The maximum URL length to allow for crawled URLs.

This setting can act as a stopping condition in case of URLs of ever-increasing
length, which may be caused for example by a programming error either in the
target server or in your code. See also :setting:`REDIRECT_MAX_TIMES` and
:setting:`DEPTH_LIMIT`.

Use ``0`` to allow URLs of any length.

The default value is copied from the `Microsoft Internet Explorer maximum URL
length`_, even though this setting exists for different reasons.

.. _Microsoft Internet Explorer maximum URL length: https://web.archive.org/web/20250206050143/https://support.microsoft.com/en-us/topic/maximum-url-length-is-2-083-characters-in-internet-explorer-174e7c8a-6666-f4e0-6fd6-908b53c12246

.. setting:: USER_AGENT

USER_AGENT
----------

Default: ``"Scrapy/VERSION (+https://scrapy.org)"``

The default User-Agent to use when crawling, unless overridden. This user agent is
also used by :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware`
if :setting:`ROBOTSTXT_USER_AGENT` setting is ``None`` and
there is no overriding User-Agent header specified for the request.

.. setting:: WARN_ON_GENERATOR_RETURN_VALUE

WARN_ON_GENERATOR_RETURN_VALUE
------------------------------

Default: ``True``

When enabled, Scrapy will warn if generator-based callback methods (like
``parse``) contain return statements with non-``None`` values. This helps detect
potential mistakes in spider development.

Disable this setting to prevent syntax errors that may occur when dynamically
modifying generator function source code during runtime, skip AST parsing of
callback functions, or improve performance in auto-reloading development
environments.

Settings documented elsewhere:
------------------------------

The following settings are documented elsewhere, please check each specific
case to see how to enable and use them.

.. settingslist::

.. _Amazon web services: https://aws.amazon.com/
.. _Google Cloud Storage: https://cloud.google.com/storage/

    :start-after: queue-common-starts
    :end-before: queue-common-ends

.. setting:: SCRAPER_SLOT_MAX_ACTIVE_SIZE

SCRAPER_SLOT_MAX_ACTIVE_SIZE
----------------------------

Default: ``5_000_000``

Soft limit (in bytes) for response data being processed.

While the sum of the sizes of all responses being processed is above this value,
Scrapy does not process new requests.

.. setting:: SPIDER_CONTRACTS

SPIDER_CONTRACTS
----------------

Default: ``{}``

A dict containing the spider contracts enabled in your project, used for
testing spiders. For more info see :ref:`topics-contracts`.

.. setting:: SPIDER_CONTRACTS_BASE

SPIDER_CONTRACTS_BASE
---------------------

Default:

.. code-block:: python

    {
        "scrapy.contracts.default.UrlContract": 1,
        "scrapy.contracts.default.ReturnsContract": 2,
        "scrapy.contracts.default.ScrapesContract": 3,
    }

A dict containing the Scrapy contracts enabled by default in Scrapy. You should
never modify this setting in your project, modify :setting:`SPIDER_CONTRACTS`
instead. For more info see :ref:`topics-contracts`.

You can disable any of these contracts by assigning ``None`` to their class
path in :setting:`SPIDER_CONTRACTS`. E.g., to disable the built-in
``ScrapesContract``, place this in your ``settings.py``:

.. code-block:: python

    SPIDER_CONTRACTS = {
        "scrapy.contracts.default.ScrapesContract": None,
    }

.. setting:: SPIDER_LOADER_CLASS

SPIDER_LOADER_CLASS
-------------------

Default: ``'scrapy.spiderloader.SpiderLoader'``

The class that will be used for loading spiders, which must implement the
:ref:`topics-api-spiderloader`.

.. setting:: SPIDER_LOADER_WARN_ONLY

SPIDER_LOADER_WARN_ONLY
-----------------------

Default: ``False``

By default, when Scrapy tries to import spider classes from :setting:`SPIDER_MODULES`,
it will fail loudly if there is any ``ImportError`` or ``SyntaxError`` exception.
But you can choose to silence this exception and turn it into a simple
warning by setting ``SPIDER_LOADER_WARN_ONLY = True``.

.. setting:: SPIDER_MIDDLEWARES

SPIDER_MIDDLEWARES
------------------

Default: ``{}``

A dict containing the spider middlewares enabled in your project, and their
orders. For more info see :ref:`topics-spider-middleware-setting`.

.. setting:: SPIDER_MIDDLEWARES_BASE

SPIDER_MIDDLEWARES_BASE
-----------------------

Default:

.. code-block:: python

    {
        "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
        "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
        "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
        "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
    }

A dict containing the spider middlewares enabled by default in Scrapy, and
their orders. Low orders are closer to the engine, high orders are closer to
the spider. For more info see :ref:`topics-spider-middleware-setting`.

.. setting:: SPIDER_MODULES

SPIDER_MODULES
--------------

Default: ``["<project name>.spiders"]`` (:ref:`fallback <default-settings>`: ``[]``)

A list of modules where Scrapy will look for spiders.

Example:

.. code-block:: python

    SPIDER_MODULES = ["mybot.spiders_prod", "mybot.spiders_dev"]

.. setting:: STATS_CLASS

STATS_CLASS
-----------

Default: ``'scrapy.statscollectors.MemoryStatsCollector'``

The class to use for collecting stats, who must implement the
:ref:`topics-api-stats`.

.. setting:: STATS_DUMP

STATS_DUMP
----------

Default: ``True``

Dump the :ref:`Scrapy stats <topics-stats>` (to the Scrapy log) once the spider
finishes.

For more info see: :ref:`topics-stats`.

.. setting:: TELNETCONSOLE_ENABLED

TELNETCONSOLE_ENABLED
---------------------

Default: ``True`` (``False`` when :setting:`TWISTED_REACTOR_ENABLED` is ``False``)

A boolean which specifies if the :ref:`telnet console <topics-telnetconsole>`
will be enabled (provided its extension is also enabled).

.. setting:: TEMPLATES_DIR

TEMPLATES_DIR
-------------

Default: ``templates`` dir inside scrapy module

The directory where to look for templates when creating new projects with
:command:`startproject` command and new spiders with :command:`genspider`
command.

The project name must not conflict with the name of custom files or directories
in the ``project`` subdirectory.

.. setting:: TWISTED_REACTOR_ENABLED

TWISTED_REACTOR_ENABLED
-----------------------

Default: ``True``

Whether to install and use the Twisted reactor.

If this is set to ``True``, Scrapy will use the Twisted reactor and will
install one according to the :setting:`TWISTED_REACTOR` setting value when
appropriate (e.g. when running via :ref:`the command-line tool
<topics-commands>`). This is the traditional mode of using Scrapy.

If this is set to ``False``, Scrapy will use the asyncio event loop directly
and will not attempt to install or use a reactor. Features that require a
reactor won't be available, but Twisted APIs that don't require a reactor,
including :class:`~twisted.internet.defer.Deferred` and
:class:`~twisted.python.failure.Failure`, will still be available. On the other
hand, limitations related to Twisted reactors (such as not being able to start
a reactor in the same process where a reactor was previously started and
stopped) will not apply. This mode is currently experimental and may not be
suitable for production use. It may also not be supported by 3rd-party code.
See :ref:`asyncio-without-reactor` for more information about this mode.

.. note:: This setting can't be set :ref:`per-spider <spider-settings>`.

.. versionadded:: 2.15.0

.. setting:: TWISTED_REACTOR

TWISTED_REACTOR
---------------

Default: ``"twisted.internet.asyncioreactor.AsyncioSelectorReactor"``

Import path of a given :mod:`~twisted.internet.reactor`.

Scrapy will install this reactor if no other reactor is installed yet, such as
when the ``scrapy`` CLI program is invoked or when using the
:class:`~scrapy.crawler.AsyncCrawlerProcess` class or the
:class:`~scrapy.crawler.CrawlerProcess` class.

If you are using the :class:`~scrapy.crawler.AsyncCrawlerRunner` class or the
:class:`~scrapy.crawler.CrawlerRunner` class, you also
need to install the correct reactor manually. You can do that using
:func:`~scrapy.utils.reactor.install_reactor`:

.. autofunction:: scrapy.utils.reactor.install_reactor

If a reactor is already installed,
:func:`~scrapy.utils.reactor.install_reactor` has no effect.

:class:`~scrapy.crawler.AsyncCrawlerRunner` and other similar classes raise an
exception if the installed reactor does not match the
:setting:`TWISTED_REACTOR` setting; therefore, having top-level
:mod:`~twisted.internet.reactor` imports in project files and imported
third-party libraries will make Scrapy raise an exception when it checks which
reactor is installed.

In order to use the reactor installed by Scrapy:

.. skip: next
.. code-block:: python

    import scrapy
    from twisted.internet import reactor

    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        def __init__(self, *args, **kwargs):
            self.timeout = int(kwargs.pop("timeout", "60"))
            super(QuotesSpider, self).__init__(*args, **kwargs)

        async def start(self):
            reactor.callLater(self.timeout, self.stop)

            urls = ["https://quotes.toscrape.com/page/1"]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {"text": quote.css("span.text::text").get()}

        def stop(self):
            self.crawler.engine.close_spider(self, "timeout")

which raises an exception, becomes:

.. code-block:: python

    import scrapy

    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        def __init__(self, *args, **kwargs):
            self.timeout = int(kwargs.pop("timeout", "60"))
            super(QuotesSpider, self).__init__(*args, **kwargs)

        async def start(self):
            from twisted.internet import reactor

            reactor.callLater(self.timeout, self.stop)

            urls = ["https://quotes.toscrape.com/page/1"]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {"text": quote.css("span.text::text").get()}

        def stop(self):
            self.crawler.engine.close_spider(self, "timeout")

If this setting is set ``None``, Scrapy will use the existing reactor if one is
already installed, or install the default reactor defined by Twisted for the
current platform.

.. versionchanged:: 2.13
   The default value was changed from ``None`` to
   ``"twisted.internet.asyncioreactor.AsyncioSelectorReactor"``.

For additional information, see :doc:`core/howto/choosing-reactor`.

.. setting:: URLLENGTH_LIMIT

URLLENGTH_LIMIT
---------------

Default: ``2083``

Scope: ``spidermiddlewares.urllength``

The maximum URL length to allow for crawled URLs.

This setting can act as a stopping condition in case of URLs of ever-increasing
length, which may be caused for example by a programming error either in the
target server or in your code. See also :setting:`REDIRECT_MAX_TIMES` and
:setting:`DEPTH_LIMIT`.

Use ``0`` to allow URLs of any length.

The default value is copied from the `Microsoft Internet Explorer maximum URL
length`_, even though this setting exists for different reasons.

.. _Microsoft Internet Explorer maximum URL length: https://web.archive.org/web/20250206050143/https://support.microsoft.com/en-us/topic/maximum-url-length-is-2-083-characters-in-internet-explorer-174e7c8a-6666-f4e0-6fd6-908b53c12246

.. setting:: USER_AGENT

USER_AGENT
----------

Default: ``"Scrapy/VERSION (+https://scrapy.org)"``

The default User-Agent to use when crawling, unless overridden. This user agent is
also used by :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware`
if :setting:`ROBOTSTXT_USER_AGENT` setting is ``None`` and
there is no overriding User-Agent header specified for the request.

.. setting:: WARN_ON_GENERATOR_RETURN_VALUE

WARN_ON_GENERATOR_RETURN_VALUE
------------------------------

Default: ``True``

When enabled, Scrapy will warn if generator-based callback methods (like
``parse``) contain return statements with non-``None`` values. This helps detect
potential mistakes in spider development.

Disable this setting to prevent syntax errors that may occur when dynamically
modifying generator function source code during runtime, skip AST parsing of
callback functions, or improve performance in auto-reloading development
environments.

Settings documented elsewhere:
------------------------------

The following settings are documented elsewhere, please check each specific
case to see how to enable and use them.

.. settingslist::

.. _Amazon web services: https://aws.amazon.com/
.. _Google Cloud Storage: https://cloud.google.com/storage/


.. _topics-shell:

============
Scrapy shell
============

The Scrapy shell is an interactive shell where you can try and debug your
scraping code very quickly, without having to run the spider. It's meant to be
used for testing data extraction code, but you can actually use it for testing
any kind of code as it is also a regular Python shell.

The shell is used for testing XPath or CSS expressions and see how they work
and what data they extract from the web pages you're trying to scrape. It
allows you to interactively test your expressions while you're writing your
spider, without having to run the spider to test every change.

Once you get familiarized with the Scrapy shell, you'll see that it's an
invaluable tool for developing and debugging your spiders.

Configuring the shell
=====================

If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
standard Python console). The `IPython`_ console is much more powerful and
provides smart auto-completion and colorized output, among other things.

We highly recommend you install `IPython`_, especially if you're working on
Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
for more info.

Scrapy also has support for `bpython`_, and will try to use it where `IPython`_
is unavailable.

Through Scrapy's settings you can configure it to use any one of
``ipython``, ``bpython`` or the standard ``python`` shell, regardless of which
are installed. This is done by setting the ``SCRAPY_PYTHON_SHELL`` environment
variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`::

    [settings]
    shell = bpython

.. _IPython: https://ipython.org/
.. _IPython installation guide: https://ipython.org/install/
.. _bpython: https://bpython-interpreter.org/

Launch the shell
================

To launch the Scrapy shell you can use the :command:`shell` command like
this::

    scrapy shell <url>

Where the ``<url>`` is the URL you want to scrape.

:command:`shell` also works for local files. This can be handy if you want
to play around with a local copy of a web page. :command:`shell` understands
the following syntaxes for local files::

    # UNIX-style
    scrapy shell ./path/to/file.html
    scrapy shell ../other/path/to/file.html
    scrapy shell /absolute/path/to/file.html

    # File URI
    scrapy shell file:///absolute/path/to/file.html

.. note:: When using relative file paths, be explicit and prepend them
    with ``./`` (or ``../`` when relevant).
    ``scrapy shell index.html`` will not work as one might expect (and
    this is by design, not a bug).

    Because :command:`shell` favors HTTP URLs over File URIs,
    and ``index.html`` being syntactically similar to ``example.com``,
    :command:`shell` will treat ``index.html`` as a domain name and trigger
    a DNS lookup error::

        $ scrapy shell index.html
        [ ... scrapy shell starts ... ]
        [ ... traceback ... ]
        twisted.internet.error.DNSLookupError: DNS lookup failed:
        address 'index.html' not found: [Errno -5] No address associated with hostname.

    :command:`shell` will not test beforehand if a file called ``index.html``
    exists in the current directory. Again, be explicit.

Using the shell
===============

The Scrapy shell is just a regular Python console (or `IPython`_ console if you
have it available) which provides some additional shortcut functions for
convenience.

Available Shortcuts
-------------------

-   ``shelp()`` - print a help with the list of available objects and
    shortcuts

-   ``fetch(url[, redirect=True])`` - fetch a new response from the given URL
    and update all related objects accordingly. You can optionally ask for HTTP
    3xx redirections to not be followed by passing ``redirect=False``

-   ``fetch(request)`` - fetch a new response from the given request and update
    all related objects accordingly.

-   ``view(response)`` - open the given response in your local web browser, for
    inspection. This will add a `\<base\> tag`_ to the response body in order
    for external links (such as images and style sheets) to display properly.
    Note, however, that this will create a temporary file in your computer,
    which won't be removed automatically.

.. _<base> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/base

Available Scrapy objects
------------------------

The Scrapy shell automatically creates some convenient objects from the
downloaded page, like the :class:`~scrapy.http.Response` object and the
:class:`~scrapy.Selector` objects (for both HTML and XML
content).

Those objects are:

-    ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.

-   ``spider`` - the Spider which is known to handle the URL, or a
    :class:`~scrapy.Spider` object if there is no spider found for the
    current URL

-   ``request`` - a :class:`~scrapy.Request` object of the last fetched
    page. You can modify this request using
    :meth:`~scrapy.Request.replace` or fetch a new request (without
    leaving the shell) using the ``fetch`` shortcut.

-   ``response`` - a :class:`~scrapy.http.Response` object containing the last
    fetched page

-   ``settings`` - the current :ref:`Scrapy settings <topics-settings>`

Example of shell session
========================

.. skip: start

Here's an example of a typical shell session where we start by scraping the
https://www.scrapy.org/ page, and then proceed to scrape the https://old.reddit.com/
page. Finally, we modify the (Reddit) request method to POST and re-fetch it
getting an error. We end the session by typing Ctrl-D (in Unix systems) or
Ctrl-Z in Windows.

Keep in mind that the data extracted here may not be the same when you try it,
as those pages are not static and could have changed by the time you test this.
The only purpose of this example is to get you familiarized with how the Scrapy
shell works.

First, we launch the shell::

    scrapy shell 'https://scrapy.org' --nolog

.. note::

   Remember to always enclose URLs in quotes when running the Scrapy shell from
   the command line, otherwise URLs containing arguments (i.e. the ``&`` character)
   will not work.

   On Windows, use double quotes instead::

       scrapy shell "https://scrapy.org" --nolog

Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you'll notice that these lines
all start with the ``[s]`` prefix)::

    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7f07395dd690>
    [s]   item       {}
    [s]   request    <GET https://scrapy.org>
    [s]   response   <200 https://scrapy.org/>
    [s]   settings   <scrapy.settings.Settings object at 0x7f07395dd710>
    [s]   spider     <DefaultSpider 'default' at 0x7f0735891690>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser

    >>>

After that, we can start playing with the objects:

.. code-block:: pycon

    >>> response.xpath("//title/text()").get()
    'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

    >>> fetch("https://old.reddit.com/")

    >>> response.xpath("//title/text()").get()
    'reddit: the front page of the internet'

    >>> request = request.replace(method="POST")

    >>> fetch(request)

    >>> response.status
    404

    >>> from pprint import pprint

    >>> pprint(response.headers)
    {'Accept-Ranges': ['bytes'],
    'Cache-Control': ['max-age=0, must-revalidate'],
    'Content-Type': ['text/html; charset=UTF-8'],
    'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
    'Server': ['snooserv'],
    'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                    'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                    'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                    'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
    'Vary': ['accept-encoding'],
    'Via': ['1.1 varnish'],
    'X-Cache': ['MISS'],
    'X-Cache-Hits': ['0'],
    'X-Content-Type-Options': ['nosniff'],
    'X-Frame-Options': ['SAMEORIGIN'],
    'X-Moose': ['majestic'],
    'X-Served-By': ['cache-cdg8730-CDG'],
    'X-Timer': ['S1481214079.394283,VS0,VE159'],
    'X-Ua-Compatible': ['IE=edge'],
    'X-Xss-Protection': ['1; mode=block']}

.. skip: end

.. _topics-shell-inspect-response:

Invoking the shell from spiders to inspect responses
====================================================

Sometimes you want to inspect the responses that are being processed in a
certain point of your spider, if only to check that response you expect is
getting there.

This can be achieved by using the ``scrapy.shell.inspect_response`` function.

Here's an example of how you would call it from your spider:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"
        start_urls = [
            "http://example.com",
            "http://example.org",
            "http://example.net",
        ]

        def parse(self, response):
            # We want to inspect one specific response.
            if ".org" in response.url:
                from scrapy.shell import inspect_response

                inspect_response(response, self)

            # Rest of parsing code.

.. skip: start

When you run the spider, you will get something similar to this::

    2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
    2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
    ...

    >>> response.url
    'http://example.org'

Then, you can check if the extraction code is working:

.. code-block:: pycon

    >>> response.xpath('//h1[@class="fn"]')
    []

Nope, it doesn't. So you can open the response in your web browser and see if
it's the response you were expecting:

.. code-block:: pycon

    >>> view(response)
    True

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the
crawling::

    >>> ^D
    2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
    ...

.. skip: end

Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is
blocked by the shell. However, after you leave the shell, the spider will
continue crawling where it stopped, as shown above.


.. _topics-signals:

=======
Signals
=======

Scrapy uses signals extensively to notify when certain events occur. You can
catch some of those signals in your Scrapy project (using an :ref:`extension
<topics-extensions>`, for example) to perform additional tasks or extend Scrapy
to add functionality not provided out of the box.

Even though signals provide several arguments, the handlers that catch them
don't need to accept all of them - the signal dispatching mechanism will only
deliver the arguments that the handler receives.

You can connect to signals (or send your own) through the
:ref:`topics-api-signals`.

Here is a simple example showing how you can catch signals and perform some action:

.. code-block:: python

    from scrapy import signals
    from scrapy import Spider

    class DmozSpider(Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
        ]

        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
            crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
            return spider

        def spider_closed(self, spider):
            spider.logger.info("Spider closed: %s", spider.name)

        def parse(self, response):
            pass

.. _signal-deferred:

Asynchronous signal handlers
============================

Some signals support returning :class:`~twisted.internet.defer.Deferred`
or :term:`awaitable objects <awaitable>` from their handlers, allowing
you to run asynchronous code that does not block Scrapy. If a signal
handler returns one of these objects, Scrapy waits for that asynchronous
operation to finish.

Let's take an example using :ref:`coroutines <topics-coroutines>`:

.. skip: next
.. code-block:: python

    import scrapy
    import treq

    class SignalSpider(scrapy.Spider):
        name = "signals"
        start_urls = ["https://quotes.toscrape.com/page/1/"]

        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
            crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
            return spider

        async def item_scraped(self, item):
            # Send the scraped item to the server
            response = await treq.post(
                "http://example.com/post",
                json.dumps(item).encode("ascii"),
                headers={b"Content-Type": [b"application/json"]},
            )

            return response

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

See the :ref:`topics-signals-ref` below to know which signals support
:class:`~twisted.internet.defer.Deferred` and :term:`awaitable objects <awaitable>`.

.. _topics-signals-ref:

Built-in signals reference
==========================

.. module:: scrapy.signals
   :synopsis: Signals definitions

Here's the list of Scrapy built-in signals and their meaning.

Engine signals
--------------

engine_started
~~~~~~~~~~~~~~

.. signal:: engine_started
.. function:: engine_started()

    Sent when the Scrapy engine has started crawling.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

.. note:: This signal may be fired *after* the :signal:`spider_opened` signal,
    depending on how the spider was started. So **don't** rely on this signal
    getting fired before :signal:`spider_opened`.

engine_stopped
~~~~~~~~~~~~~~

.. signal:: engine_stopped
.. function:: engine_stopped()

    Sent when the Scrapy engine is stopped (for example, when a crawling
    process has finished).

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

scheduler_empty
~~~~~~~~~~~~~~~

.. signal:: scheduler_empty
.. function:: scheduler_empty()

    Sent whenever the engine asks for a pending request from the
    :ref:`scheduler <topics-scheduler>` (i.e. calls its
    :meth:`~scrapy.core.scheduler.BaseScheduler.next_request` method) and the
    scheduler returns none.

    See :ref:`start-requests-lazy` for an example.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

Item signals
------------

.. note::
    As at max :setting:`CONCURRENT_ITEMS` items are processed in
    parallel, many deferreds are fired together using
    :class:`~twisted.internet.defer.DeferredList`. Hence the next
    batch waits for the :class:`~twisted.internet.defer.DeferredList`
    to fire and then runs the respective item signal handler for
    the next batch of scraped items.

item_scraped
~~~~~~~~~~~~

.. signal:: item_scraped
.. function:: item_scraped(item, response, spider)

    Sent when an item has been scraped, after it has passed all the
    :ref:`topics-item-pipeline` stages (without being dropped).

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param item: the scraped item
    :type item: :ref:`item object <item-types>`

    :param spider: the spider which scraped the item
    :type spider: :class:`~scrapy.Spider` object

    :param response: the response from where the item was scraped, or ``None``
        if it was yielded from :meth:`~scrapy.Spider.start`.
    :type response: :class:`~scrapy.http.Response` | ``None``

item_dropped
~~~~~~~~~~~~

.. signal:: item_dropped
.. function:: item_dropped(item, response, exception, spider)

    Sent after an item has been dropped from the :ref:`topics-item-pipeline`
    when some stage raised a :exc:`~scrapy.exceptions.DropItem` exception.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param item: the item dropped from the :ref:`topics-item-pipeline`
    :type item: :ref:`item object <item-types>`

    :param spider: the spider which scraped the item
    :type spider: :class:`~scrapy.Spider` object

    :param response: the response from where the item was dropped, or ``None``
        if it was yielded from :meth:`~scrapy.Spider.start`.
    :type response: :class:`~scrapy.http.Response` | ``None``

    :param exception: the exception (which must be a
        :exc:`~scrapy.exceptions.DropItem` subclass) which caused the item
        to be dropped
    :type exception: :exc:`~scrapy.exceptions.DropItem` exception

item_error
~~~~~~~~~~

.. signal:: item_error
.. function:: item_error(item, response, spider, failure)

    Sent when a :ref:`topics-item-pipeline` generates an error (i.e. raises
    an exception), except :exc:`~scrapy.exceptions.DropItem` exception.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param item: the item that caused the error in the :ref:`topics-item-pipeline`
    :type item: :ref:`item object <item-types>`

    :param response: the response being processed when the exception was
        raised, or ``None`` if it was yielded from
        :meth:`~scrapy.Spider.start`.
    :type response: :class:`~scrapy.http.Response` | ``None``

    :param spider: the spider which raised the exception
    :type spider: :class:`~scrapy.Spider` object

    :param failure: the exception raised
    :type failure: twisted.python.failure.Failure

Spider signals
--------------

spider_closed
~~~~~~~~~~~~~

.. signal:: spider_closed
.. function:: spider_closed(spider, reason)

    Sent after a spider has been closed. This can be used to release per-spider
    resources reserved on :signal:`spider_opened`.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param spider: the spider which has been closed
    :type spider: :class:`~scrapy.Spider` object

    :param reason: a string which describes the reason why the spider was closed. If
        it was closed because the spider has completed scraping, the reason
        is ``'finished'``. Otherwise, if the spider was manually closed by
        calling the ``close_spider`` engine method, then the reason is the one
        passed in the ``reason`` argument of that method (which defaults to
        ``'cancelled'``). If the engine was shutdown (for example, by hitting
        Ctrl-C to stop it) the reason will be ``'shutdown'``.
    :type reason: str

spider_opened
~~~~~~~~~~~~~

.. signal:: spider_opened
.. function:: spider_opened(spider)

    Sent after a spider has been opened for crawling. This is typically used to
    reserve per-spider resources, but can be used for any task that needs to be
    performed when a spider is opened.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param spider: the spider which has been opened
    :type spider: :class:`~scrapy.Spider` object

spider_idle
~~~~~~~~~~~

.. signal:: spider_idle
.. function:: spider_idle(spider)

    Sent when a spider has gone idle, which means the spider has no further:

        * requests waiting to be downloaded
        * requests scheduled
        * items being processed in the item pipeline

    If the idle state persists after all handlers of this signal have finished,
    the engine starts closing the spider. After the spider has finished
    closing, the :signal:`spider_closed` signal is sent.

    You may raise a :exc:`~scrapy.exceptions.DontCloseSpider` exception to
    prevent the spider from being closed.

    Alternatively, you may raise a :exc:`~scrapy.exceptions.CloseSpider`
    exception to provide a custom spider closing reason. An
    idle handler is the perfect place to put some code that assesses
    the final spider results and update the final closing reason
    accordingly (e.g. setting it to 'too_few_results' instead of
    'finished').

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param spider: the spider which has gone idle
    :type spider: :class:`~scrapy.Spider` object

    .. note:: Scheduling some requests in your :signal:`spider_idle` handler does
        **not** guarantee that it can prevent the spider from being closed,
        although it sometimes can. That's because the spider may still remain idle
        if all the scheduled requests are rejected by the scheduler (e.g. filtered
        due to duplication).

spider_error
~~~~~~~~~~~~

.. signal:: spider_error
.. function:: spider_error(failure, response, spider)

    Sent when a spider callback generates an error (i.e. raises an exception).

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param failure: the exception raised
    :type failure: twisted.python.failure.Failure

    :param response: the response being processed when the exception was raised
    :type response: :class:`~scrapy.http.Response` object

    :param spider: the spider which raised the exception
    :type spider: :class:`~scrapy.Spider` object

feed_slot_closed
~~~~~~~~~~~~~~~~

.. signal:: feed_slot_closed
.. function:: feed_slot_closed(slot)

    Sent when a :ref:`feed exports <topics-feed-exports>` slot is closed.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

    :param slot: the slot closed
    :type slot: scrapy.extensions.feedexport.FeedSlot

feed_exporter_closed
~~~~~~~~~~~~~~~~~~~~

.. signal:: feed_exporter_closed
.. function:: feed_exporter_closed()

    Sent when the :ref:`feed exports <topics-feed-exports>` extension is closed,
    during the handling of the :signal:`spider_closed` signal by the extension,
    after all feed exporting has been handled.

    This signal supports :ref:`asynchronous handlers <signal-deferred>`.

memusage_warning_reached
~~~~~~~~~~~~~~~~~~~~~~~~

.. signal:: memusage_warning_reached

.. function:: memusage_warning_reached()

    Sent by the :class:`~scrapy.extensions.memusage.MemoryUsage` extension when the
    memory usage reaches the warning threshold (:setting:`MEMUSAGE_WARNING_MB`).

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

Request signals
---------------

request_scheduled
~~~~~~~~~~~~~~~~~

.. signal:: request_scheduled
.. function:: request_scheduled(request, spider)

    Sent when the engine is asked to schedule a :class:`~scrapy.Request`, to be
    downloaded later, before the request reaches the :ref:`scheduler
    <topics-scheduler>`.

    Raise :exc:`~scrapy.exceptions.IgnoreRequest` to drop a request before it
    reaches the scheduler.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    .. versionadded:: 2.11.2
        Allow dropping requests with :exc:`~scrapy.exceptions.IgnoreRequest`.

    :param request: the request that reached the scheduler
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider that yielded the request
    :type spider: :class:`~scrapy.Spider` object

request_dropped
~~~~~~~~~~~~~~~

.. signal:: request_dropped
.. function:: request_dropped(request, spider)

    Sent when a :class:`~scrapy.Request`, scheduled by the engine to be
    downloaded later, is rejected by the scheduler.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param request: the request that reached the scheduler
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider that yielded the request
    :type spider: :class:`~scrapy.Spider` object

request_reached_downloader
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. signal:: request_reached_downloader
.. function:: request_reached_downloader(request, spider)

    Sent when a :class:`~scrapy.Request` reached downloader.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param request: the request that reached downloader
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider that yielded the request
    :type spider: :class:`~scrapy.Spider` object

request_left_downloader
~~~~~~~~~~~~~~~~~~~~~~~

.. signal:: request_left_downloader
.. function:: request_left_downloader(request, spider)

    Sent when a :class:`~scrapy.Request` leaves the downloader, even in case of
    failure.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param request: the request that reached the downloader
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider that yielded the request
    :type spider: :class:`~scrapy.Spider` object

bytes_received
~~~~~~~~~~~~~~

.. signal:: bytes_received
.. function:: bytes_received(data, request, spider)

    Sent by the HTTP 1.1 and S3 download handlers when a group of bytes is
    received for a specific request. This signal might be fired multiple
    times for the same request, with partial data each time. For instance,
    a possible scenario for a 25 kb response would be two signals fired
    with 10 kb of data, and a final one with 5 kb of data.

    Handlers for this signal can stop the download of a response while it
    is in progress by raising the :exc:`~scrapy.exceptions.StopDownload`
    exception. Please refer to the :ref:`topics-stop-response-download` topic
    for additional information and examples.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param data: the data received by the download handler
    :type data: :class:`bytes` object

    :param request: the request that generated the download
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider associated with the response
    :type spider: :class:`~scrapy.Spider` object

headers_received
~~~~~~~~~~~~~~~~

.. signal:: headers_received
.. function:: headers_received(headers, body_length, request, spider)

    Sent by the HTTP 1.1 and S3 download handlers when the response headers are
    available for a given request, before downloading any additional content.

    Handlers for this signal can stop the download of a response while it
    is in progress by raising the :exc:`~scrapy.exceptions.StopDownload`
    exception. Please refer to the :ref:`topics-stop-response-download` topic
    for additional information and examples.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param headers: the headers received by the download handler
    :type headers: :class:`scrapy.http.headers.Headers` object

    :param body_length: expected size of the response body, in bytes
    :type body_length: `int`

    :param request: the request that generated the download
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider associated with the response
    :type spider: :class:`~scrapy.Spider` object

Response signals
----------------

response_received
~~~~~~~~~~~~~~~~~

.. signal:: response_received
.. function:: response_received(response, request, spider)

    Sent when the engine receives a new :class:`~scrapy.http.Response` from the
    downloader.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param response: the response received
    :type response: :class:`~scrapy.http.Response` object

    :param request: the request that generated the response
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider for which the response is intended
    :type spider: :class:`~scrapy.Spider` object

.. note:: The ``request`` argument might not contain the original request that
    reached the downloader, if a :ref:`topics-downloader-middleware` modifies
    the :class:`~scrapy.http.Response` object and sets a specific ``request``
    attribute.

response_downloaded
~~~~~~~~~~~~~~~~~~~

.. signal:: response_downloaded
.. function:: response_downloaded(response, request, spider)

    Sent by the downloader right after a :class:`~scrapy.http.Response` is downloaded.

    This signal does not support :ref:`asynchronous handlers <signal-deferred>`.

    :param response: the response downloaded
    :type response: :class:`~scrapy.http.Response` object

    :param request: the request that generated the response
    :type request: :class:`~scrapy.Request` object

    :param spider: the spider for which the response is intended
    :type spider: :class:`~scrapy.Spider` object


.. _topics-spider-middleware:

=================
Spider Middleware
=================

The spider middleware is a framework of hooks into Scrapy's spider processing
mechanism where you can plug custom functionality to process the responses that
are sent to :ref:`topics-spiders` for processing and to process the requests
and items that are generated from spiders.

.. _topics-spider-middleware-setting:

Activating a spider middleware
==============================

To activate a spider middleware component, add it to the
:setting:`SPIDER_MIDDLEWARES` setting, which is a dict whose keys are the
middleware class path and their values are the middleware orders.

Here's an example:

.. code-block:: python

    SPIDER_MIDDLEWARES = {
        "myproject.middlewares.CustomSpiderMiddleware": 543,
    }

The :setting:`SPIDER_MIDDLEWARES` setting is merged with the
:setting:`SPIDER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant to
be overridden) and then sorted by order to get the final sorted list of enabled
middlewares: the first middleware is the one closer to the engine and the last
is the one closer to the spider. In other words,
the :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_input`
method of each middleware will be invoked in increasing
middleware order (100, 200, 300, ...), and the
:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output` method
of each middleware will be invoked in decreasing order.

To decide which order to assign to your middleware see the
:setting:`SPIDER_MIDDLEWARES_BASE` setting and pick a value according to where
you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.

If you want to disable a builtin middleware (the ones defined in
:setting:`SPIDER_MIDDLEWARES_BASE`, and enabled by default) you must define it
in your project :setting:`SPIDER_MIDDLEWARES` setting and assign ``None`` as its
value.  For example, if you want to disable the off-site middleware:

.. code-block:: python

    SPIDER_MIDDLEWARES = {
        "scrapy.spidermiddlewares.referer.RefererMiddleware": None,
        "myproject.middlewares.CustomRefererSpiderMiddleware": 700,
    }

Finally, keep in mind that some middlewares may need to be enabled through a
particular setting. See each middleware documentation for more info.

.. _custom-spider-middleware:

Writing your own spider middleware
==================================

Each spider middleware is a :ref:`component <topics-components>` that defines
one or more of these methods:

.. module:: scrapy.spidermiddlewares

.. class:: SpiderMiddleware

    .. method:: process_start(start: AsyncIterator[Any], /) -> AsyncIterator[Any]
        :async:

        Iterate over the output of :meth:`~scrapy.Spider.start` or that
        of the :meth:`process_start` method of an earlier spider middleware,
        overriding it. For example:

        .. code-block:: python

            async def process_start(self, start):
                async for item_or_request in start:
                    yield item_or_request

        You may yield the same type of objects as :meth:`~scrapy.Spider.start`.

        To write spider middlewares that work on Scrapy versions lower than
        2.13, define also a synchronous ``process_start_requests()`` method
        that returns an iterable. For example:

        .. code-block:: python

            def process_start_requests(self, start, spider):
                yield from start

    .. method:: process_spider_input(response)

        This method is called for each response that goes through the spider
        middleware and into the spider, for processing.

        :meth:`process_spider_input` should return ``None`` or raise an
        exception.

        If it returns ``None``, Scrapy will continue processing this response,
        executing all other middlewares until, finally, the response is handed
        to the spider for processing.

        If it raises an exception, Scrapy won't bother calling any other spider
        middleware :meth:`process_spider_input` and will call the request
        errback if there is one, otherwise it will start the :meth:`process_spider_exception`
        chain. The output of the errback is chained back in the other
        direction for :meth:`process_spider_output` to process it, or
        :meth:`process_spider_exception` if it raised an exception.

        :param response: the response being processed
        :type response: :class:`~scrapy.http.Response` object

    .. method:: process_spider_output(response, result)

        This method is called with the results returned from the Spider, after
        it has processed the response.

        :meth:`process_spider_output` must return an iterable of
        :class:`~scrapy.Request` objects and :ref:`item objects
        <topics-items>`.

        Consider defining this method as an :term:`asynchronous generator`,
        which will be a requirement in a future version of Scrapy. However, if
        you plan on sharing your spider middleware with other people, consider
        either :ref:`enforcing Scrapy 2.7 <enforce-component-requirements>`
        as a minimum requirement of your spider middleware, or :ref:`making
        your spider middleware universal <universal-spider-middleware>` so that
        it works with Scrapy versions earlier than Scrapy 2.7.

        :param response: the response which generated this output from the
          spider
        :type response: :class:`~scrapy.http.Response` object

        :param result: the result returned by the spider
        :type result: an iterable of :class:`~scrapy.Request` objects and
          :ref:`item objects <topics-items>`

    .. method:: process_spider_output_async(response, result)
        :async:

        If defined, this method must be an :term:`asynchronous generator`,
        which will be called instead of :meth:`process_spider_output` if
        ``result`` is an :term:`asynchronous iterable`.

    .. method:: process_spider_exception(response, exception)

        This method is called when a spider or :meth:`process_spider_output`
        method (from a previous spider middleware) raises an exception.

        :meth:`process_spider_exception` should return either ``None`` or an
        iterable of :class:`~scrapy.Request` or :ref:`item <topics-items>`
        objects.

        If it returns ``None``, Scrapy will continue processing this exception,
        executing any other :meth:`process_spider_exception` in the following
        middleware components, until no middleware components are left and the
        exception reaches the engine (where it's logged and discarded).

        If it returns an iterable the :meth:`process_spider_output` pipeline
        kicks in, starting from the next spider middleware, and no other
        :meth:`process_spider_exception` will be called.

        :param response: the response being processed when the exception was
          raised
        :type response: :class:`~scrapy.http.Response` object

        :param exception: the exception raised
        :type exception: :exc:`Exception` object

Base class for custom spider middlewares
----------------------------------------

Scrapy provides a base class for custom spider middlewares. It's not required
to use it but it can help with simplifying middleware implementations and
reducing the amount of boilerplate code in :ref:`universal middlewares
<universal-spider-middleware>`.

.. module:: scrapy.spidermiddlewares.base

.. autoclass:: BaseSpiderMiddleware
   :members:

.. _topics-spider-middleware-ref:

Built-in spider middleware reference
====================================

This page describes all spider middleware components that come with Scrapy. For
information on how to use them and how to write your own spider middleware, see
the :ref:`spider middleware usage guide <topics-spider-middleware>`.

For a list of the components enabled by default (and their orders) see the
:setting:`SPIDER_MIDDLEWARES_BASE` setting.

DepthMiddleware
---------------

.. module:: scrapy.spidermiddlewares.depth
   :synopsis: Depth Spider Middleware

.. class:: DepthMiddleware

   DepthMiddleware is used for tracking the depth of each Request inside the
   site being scraped. It works by setting ``request.meta['depth'] = 0`` whenever
   there is no value previously set (usually just the first Request) and
   incrementing it by 1 otherwise.

   It can be used to limit the maximum depth to scrape, control Request
   priority based on their depth, and things like that.

   The :class:`DepthMiddleware` can be configured through the following
   settings (see the settings documentation for more info):

      * :setting:`DEPTH_LIMIT` - The maximum depth that will be allowed to
        crawl for any site. If zero, no limit will be imposed.
      * :setting:`DEPTH_STATS_VERBOSE` - Whether to collect the number of
        requests for each depth.
      * :setting:`DEPTH_PRIORITY` - Whether to prioritize the requests based on
        their depth.

HttpErrorMiddleware
-------------------

.. module:: scrapy.spidermiddlewares.httperror
   :synopsis: HTTP Error Spider Middleware

.. class:: HttpErrorMiddleware

    Filter out unsuccessful (erroneous) HTTP responses so that spiders don't
    have to deal with them, which (most of the time) imposes an overhead,
    consumes more resources, and makes the spider logic more complex.

According to the `HTTP standard`_, successful responses are those whose
status codes are in the 200-300 range.

.. _HTTP standard: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

If you still want to process response codes outside that range, you can
specify which response codes the spider is able to handle using the
``handle_httpstatus_list`` spider attribute or
:setting:`HTTPERROR_ALLOWED_CODES` setting.

For example, if you want your spider to handle 404 responses you can do
this:

.. code-block:: python

    from scrapy.spiders import CrawlSpider

    class MySpider(CrawlSpider):
        handle_httpstatus_list = [404]

.. reqmeta:: handle_httpstatus_list

.. reqmeta:: handle_httpstatus_all

The ``handle_httpstatus_list`` key of :attr:`Request.meta
<scrapy.Request.meta>` can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key ``handle_httpstatus_all``
to ``True`` if you want to allow any response code for a request, and ``False`` to
disable the effects of the ``handle_httpstatus_all`` key.

Keep in mind, however, that it's usually a bad idea to handle non-200
responses, unless you really know what you're doing.

For more information see: `HTTP Status Code Definitions`_.

.. _HTTP Status Code Definitions: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

HttpErrorMiddleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: HTTPERROR_ALLOWED_CODES

HTTPERROR_ALLOWED_CODES
^^^^^^^^^^^^^^^^^^^^^^^

Default: ``[]``

Pass all responses with non-200 status codes contained in this list.

.. setting:: HTTPERROR_ALLOW_ALL

HTTPERROR_ALLOW_ALL
^^^^^^^^^^^^^^^^^^^

Default: ``False``

Pass all responses, regardless of its status code.

RefererMiddleware
-----------------

.. module:: scrapy.spidermiddlewares.referer
   :synopsis: Referer Spider Middleware

.. class:: RefererMiddleware

   Populates Request ``Referer`` header, based on the URL of the Response which
   generated it.

RefererMiddleware settings
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. setting:: REFERER_ENABLED

REFERER_ENABLED
^^^^^^^^^^^^^^^

Default: ``True``

Whether to enable referer middleware.

.. setting:: REFERRER_POLICY

REFERRER_POLICY
^^^^^^^^^^^^^^^

Default: ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'``

.. reqmeta:: referrer_policy

`Referrer Policy`_ to apply when populating Request "Referer" header.

.. note::
    You can also set the Referrer Policy per request,
    using the special ``"referrer_policy"`` :ref:`Request.meta <topics-request-meta>` key,
    with the same acceptable values as for the ``REFERRER_POLICY`` setting.

Acceptable values for REFERRER_POLICY
*************************************

- either a path to a :class:`scrapy.spidermiddlewares.referer.ReferrerPolicy`
  subclass — a custom policy or one of the built-in ones (see classes below),
- or one or more comma-separated standard W3C-defined string values,
- or the special ``"scrapy-default"``.

=======================================  ========================================================================
String value                             Class name (as a string)
=======================================  ========================================================================
``"scrapy-default"`` (default)           :class:`scrapy.spidermiddlewares.referer.DefaultReferrerPolicy`
`"no-referrer"`_                         :class:`scrapy.spidermiddlewares.referer.NoReferrerPolicy`
`"no-referrer-when-downgrade"`_          :class:`scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy`
`"same-origin"`_                         :class:`scrapy.spidermiddlewares.referer.SameOriginPolicy`
`"origin"`_                              :class:`scrapy.spidermiddlewares.referer.OriginPolicy`
`"strict-origin"`_                       :class:`scrapy.spidermiddlewares.referer.StrictOriginPolicy`
`"origin-when-cross-origin"`_            :class:`scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy`
`"strict-origin-when-cross-origin"`_     :class:`scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy`
`"unsafe-url"`_                          :class:`scrapy.spidermiddlewares.referer.UnsafeUrlPolicy`
=======================================  ========================================================================

.. autoclass:: ReferrerPolicy

.. autoclass:: DefaultReferrerPolicy
.. warning::
    Scrapy's default referrer policy — just like `"no-referrer-when-downgrade"`_,
    the W3C-recommended value for browsers — will send a non-empty
    "Referer" header from any ``http(s)://`` to any ``https://`` URL,
    even if the domain is different.

    `"same-origin"`_ may be a better choice if you want to remove referrer
    information for cross-domain requests.

.. autoclass:: NoReferrerPolicy

.. autoclass:: NoReferrerWhenDowngradePolicy
.. note::
    "no-referrer-when-downgrade" policy is the W3C-recommended default,
    and is used by major web browsers.

    However, it is NOT Scrapy's default referrer policy (see :class:`DefaultReferrerPolicy`).

.. autoclass:: SameOriginPolicy

.. autoclass:: OriginPolicy

.. autoclass:: StrictOriginPolicy

.. autoclass:: OriginWhenCrossOriginPolicy

.. autoclass:: StrictOriginWhenCrossOriginPolicy

.. autoclass:: UnsafeUrlPolicy
.. warning::
    "unsafe-url" policy is NOT recommended.

.. _Referrer Policy: https://www.w3.org/TR/referrer-policy
.. _"no-referrer": https://www.w3.org/TR/referrer-policy/#referrer-policy-no-referrer
.. _"no-referrer-when-downgrade": https://www.w3.org/TR/referrer-policy/#referrer-policy-no-referrer-when-downgrade
.. _"same-origin": https://www.w3.org/TR/referrer-policy/#referrer-policy-same-origin
.. _"origin": https://www.w3.org/TR/referrer-policy/#referrer-policy-origin
.. _"strict-origin": https://www.w3.org/TR/referrer-policy/#referrer-policy-strict-origin
.. _"origin-when-cross-origin": https://www.w3.org/TR/referrer-policy/#referrer-policy-origin-when-cross-origin
.. _"strict-origin-when-cross-origin": https://www.w3.org/TR/referrer-policy/#referrer-policy-strict-origin-when-cross-origin
.. _"unsafe-url": https://www.w3.org/TR/referrer-policy/#referrer-policy-unsafe-url

.. setting:: REFERRER_POLICIES

REFERRER_POLICIES
^^^^^^^^^^^^^^^^^

.. versionadded:: 2.14.2

Default: ``{}``

A dictionary mapping policy names to import paths of
:class:`scrapy.spidermiddlewares.referer.ReferrerPolicy` subclasses, or
``None`` to disable support for a given policy name.

This allows overriding the policies triggered by the ``Referrer-Policy``
response header.

Use ``""`` to override the policy for responses with `no referrer policy
<https://www.w3.org/TR/referrer-policy/#referrer-policy-empty-string>`__.

StartSpiderMiddleware
---------------------

.. module:: scrapy.spidermiddlewares.start

.. autoclass:: StartSpiderMiddleware

UrlLengthMiddleware
-------------------

.. module:: scrapy.spidermiddlewares.urllength
   :synopsis: URL Length Spider Middleware

.. class:: UrlLengthMiddleware

   Filters out requests with URLs longer than URLLENGTH_LIMIT

   The :class:`UrlLengthMiddleware` can be configured through the following
   settings (see the settings documentation for more info):

      * :setting:`URLLENGTH_LIMIT` - The maximum URL length to allow for crawled URLs.


.. _topics-spiders:

=======
Spiders
=======

Spiders are classes which define how a certain site (or a group of sites) will be
scraped, including how to perform the crawl (i.e. follow links) and how to
extract structured data from their pages (i.e. scraping items). In other words,
Spiders are the place where you define the custom behaviour for crawling and
parsing pages for a particular site (or, in some cases, a group of sites).

For spiders, the scraping cycle goes through something like this:

1. You start by generating the initial requests to crawl the first URLs, and
   specify a callback function to be called with the response downloaded from
   those requests.

   The first requests to perform are obtained by iterating the
   :meth:`~scrapy.Spider.start` method, which by default yields a
   :class:`~scrapy.Request` object for each URL in the
   :attr:`~scrapy.Spider.start_urls` spider attribute, with the
   :attr:`~scrapy.Spider.parse` method set as :attr:`~scrapy.Request.callback`
   function to handle each :class:`~scrapy.http.Response`.

2. In the callback function, you parse the response (web page) and return
   :ref:`item objects <topics-items>`,
   :class:`~scrapy.Request` objects, or an iterable of these objects.
   Those Requests will also contain a callback (maybe
   the same) and will then be downloaded by Scrapy and then their
   response handled by the specified callback.

3. In callback functions, you parse the page contents, typically using
   :ref:`topics-selectors` (but you can also use BeautifulSoup, lxml or whatever
   mechanism you prefer) and generate items with the parsed data.

4. Finally, the items returned from the spider will be typically persisted to a
   database (in some :ref:`Item Pipeline <topics-item-pipeline>`) or written to
   a file using :ref:`topics-feed-exports`.

Even though this cycle applies (more or less) to any kind of spider, there are
different kinds of default spiders bundled into Scrapy for different purposes.
We will talk about those types here.

.. _topics-spiders-ref:

scrapy.Spider
=============

.. class:: scrapy.spiders.Spider
.. autoclass:: scrapy.Spider

   .. attribute:: name

       A string which defines the name for this spider. The spider name is how
       the spider is located (and instantiated) by Scrapy, so it must be
       unique. However, nothing prevents you from instantiating more than one
       instance of the same spider. This is the most important spider attribute
       and it's required.

       If the spider scrapes a single domain, a common practice is to name the
       spider after the domain, with or without the `TLD`_. So, for example, a
       spider that crawls ``mywebsite.com`` would often be called
       ``mywebsite``.

   .. attribute:: allowed_domains

       An optional list of strings containing domains that this spider is
       allowed to crawl. Requests for URLs not belonging to the domain names
       specified in this list (or their subdomains) won't be followed if
       :class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` is
       enabled.

       Let's say your target url is ``https://www.example.com/1.html``,
       then add ``'example.com'`` to the list.

   .. autoattribute:: start_urls

   .. attribute:: custom_settings

      A dictionary of settings that will be overridden from the project wide
      configuration when running this spider. It must be defined as a class
      attribute since the settings are updated before instantiation.

      For a list of available built-in settings see:
      :ref:`topics-settings-ref`.

   .. attribute:: crawler

      This attribute is set by the :meth:`from_crawler` class method after
      initializing the class, and links to the
      :class:`~scrapy.crawler.Crawler` object to which this spider instance is
      bound.

      Crawlers encapsulate a lot of components in the project for their single
      entry access (such as extensions, middlewares, signals managers, etc).
      See :ref:`topics-api-crawler` to know more about them.

   .. attribute:: settings

      Configuration for running this spider. This is a
      :class:`~scrapy.settings.Settings` instance, see the
      :ref:`topics-settings` topic for a detailed introduction on this subject.

   .. attribute:: logger

      Python logger created with the Spider's :attr:`name`. You can use it to
      send log messages through it as described on
      :ref:`topics-logging-from-spiders`.

   .. attribute:: state

      A dict you can use to persist some spider state between batches.
      See :ref:`topics-keeping-persistent-state-between-batches` to know more about it.

   .. method:: from_crawler(crawler, *args, **kwargs)

       This is the class method used by Scrapy to create your spiders.

       You probably won't need to override this directly because the default
       implementation acts as a proxy to the :meth:`__init__` method, calling
       it with the given arguments ``args`` and named arguments ``kwargs``.

       Nonetheless, this method sets the :attr:`crawler` and :attr:`settings`
       attributes in the new instance so they can be accessed later inside the
       spider's code.

       .. versionchanged:: 2.11

           The settings in ``crawler.settings`` can now be modified in this
           method, which is handy if you want to modify them based on
           arguments. As a consequence, these settings aren't the final values
           as they can be modified later by e.g. :ref:`add-ons
           <topics-addons>`. For the same reason, most of the
           :class:`~scrapy.crawler.Crawler` attributes aren't initialized at
           this point.

           The final settings and the initialized
           :class:`~scrapy.crawler.Crawler` attributes are available in the
           :meth:`start` method, handlers of the
           :signal:`engine_started` signal and later.

       :param crawler: crawler to which the spider will be bound
       :type crawler: :class:`~scrapy.crawler.Crawler` instance

       :param args: arguments passed to the :meth:`__init__` method
       :type args: list

       :param kwargs: keyword arguments passed to the :meth:`__init__` method
       :type kwargs: dict

   .. classmethod:: update_settings(settings)

       The ``update_settings()`` method is used to modify the spider's settings
       and is called during initialization of a spider instance.

       It takes a :class:`~scrapy.settings.Settings` object as a parameter and
       can add or update the spider's configuration values. This method is a
       class method, meaning that it is called on the :class:`~scrapy.Spider`
       class and allows all instances of the spider to share the same
       configuration.

       While per-spider settings can be set in
       :attr:`~scrapy.Spider.custom_settings`, using ``update_settings()``
       allows you to dynamically add, remove or change settings based on other
       settings, spider attributes or other factors and use setting priorities
       other than ``'spider'``. Also, it's easy to extend ``update_settings()``
       in a subclass by overriding it, while doing the same with
       :attr:`~scrapy.Spider.custom_settings` can be hard.

       For example, suppose a spider needs to modify :setting:`FEEDS`:

       .. code-block:: python

           import scrapy

           class MySpider(scrapy.Spider):
               name = "myspider"
               custom_feed = {
                   "/home/user/documents/items.json": {
                       "format": "json",
                       "indent": 4,
                   }
               }

               @classmethod
               def update_settings(cls, settings):
                   super().update_settings(settings)
                   settings.setdefault("FEEDS", {}).update(cls.custom_feed)

   .. automethod:: start

   .. method:: parse(response)

       This is the default callback used by Scrapy to process downloaded
       responses, when their requests don't specify a callback.

       The ``parse`` method is in charge of processing the response and returning
       scraped data and/or more URLs to follow. Other Requests callbacks have
       the same requirements as the :class:`~scrapy.Spider` class.

       This method, as well as any other Request callback, must return a
       :class:`~scrapy.Request` object, an :ref:`item object <topics-items>`, an
       iterable of :class:`~scrapy.Request` objects and/or :ref:`item objects
       <topics-items>`, or ``None``.

       :param response: the response to parse
       :type response: :class:`~scrapy.http.Response`

   .. method:: log(message, [level, component])

       Wrapper that sends a log message through the Spider's :attr:`logger`,
       kept for backward compatibility. For more information see
       :ref:`topics-logging-from-spiders`.

   .. method:: closed(reason)

       Called when the spider closes. This method provides a shortcut to
       signals.connect() for the :signal:`spider_closed` signal.

Let's see an example:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "example.com"
        allowed_domains = ["example.com"]
        start_urls = [
            "http://www.example.com/1.html",
            "http://www.example.com/2.html",
            "http://www.example.com/3.html",
        ]

        def parse(self, response):
            self.logger.info("A response from %s just arrived!", response.url)

Return multiple Requests and items from a single callback:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "example.com"
        allowed_domains = ["example.com"]
        start_urls = [
            "http://www.example.com/1.html",
            "http://www.example.com/2.html",
            "http://www.example.com/3.html",
        ]

        def parse(self, response):
            for h3 in response.xpath("//h3").getall():
                yield {"title": h3}

            for href in response.xpath("//a/@href").getall():
                yield scrapy.Request(response.urljoin(href), self.parse)

Instead of :attr:`~.start_urls` you can use :meth:`~scrapy.Spider.start`
directly; to give data more structure you can use :class:`~scrapy.Item`
objects:

.. skip: next
.. code-block:: python

    import scrapy
    from myproject.items import MyItem

    class MySpider(scrapy.Spider):
        name = "example.com"
        allowed_domains = ["example.com"]

        async def start(self):
            yield scrapy.Request("http://www.example.com/1.html", self.parse)
            yield scrapy.Request("http://www.example.com/2.html", self.parse)
            yield scrapy.Request("http://www.example.com/3.html", self.parse)

        def parse(self, response):
            for h3 in response.xpath("//h3").getall():
                yield MyItem(title=h3)

            for href in response.xpath("//a/@href").getall():
                yield scrapy.Request(response.urljoin(href), self.parse)

.. _spiderargs:

Spider arguments
================

Spiders can receive arguments that modify their behaviour. Some common uses for
spider arguments are to define the start URLs or to restrict the crawl to
certain sections of the site, but they can be used to configure any
functionality of the spider.

Spider arguments are passed through the :command:`crawl` command using the
``-a`` option. For example::

    scrapy crawl myspider -a category=electronics

Spiders can access arguments in their `__init__` methods:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"

        def __init__(self, category=None, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.start_urls = [f"http://www.example.com/categories/{category}"]
            # ...

The default `__init__` method will take any spider arguments
and copy them to the spider as attributes.
The above example can also be written as follows:

.. code-block:: python

    import scrapy

    class MySpider(scrapy.Spider):
        name = "myspider"

        async def start(self):
            yield scrapy.Request(f"http://www.example.com/categories/{self.category}")

If you are :ref:`running Scrapy from a script <run-from-script>`, you can
specify spider arguments when calling
:class:`CrawlerProcess.crawl <scrapy.crawler.CrawlerProcess.crawl>` or
:class:`CrawlerRunner.crawl <scrapy.crawler.CrawlerRunner.crawl>`:

.. skip: next
.. code-block:: python

    process = CrawlerProcess()
    process.crawl(MySpider, category="electronics")

Keep in mind that spider arguments are only strings.
The spider will not do any parsing on its own.
If you were to set the ``start_urls`` attribute from the command line,
you would have to parse it on your own into a list
using something like :func:`ast.literal_eval` or :func:`json.loads`
and then set it as an attribute.
Otherwise, you would cause iteration over a ``start_urls`` string
(a very common python pitfall)
resulting in each character being seen as a separate url.

A valid use case is to set the http auth credentials
used by :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`::

    scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword

Spider arguments can also be passed through the Scrapyd ``schedule.json`` API.
See `Scrapyd documentation`_.

.. _spiderargs-scrapy-spider-metadata:

scrapy-spider-metadata parameters
---------------------------------

Another alternative to pass spider arguments is the library `scrapy-spider-metadata`_.

This allows for Scrapy spiders to define, validate, document and pre-process
their arguments as Pydantic models.

The example shows how to define typed parameters where a string argument
is automatically converted to an integer:

.. code-block:: python

    import scrapy
    from pydantic import BaseModel
    from scrapy_spider_metadata import Args

    class MyParams(BaseModel):
        pages: int

    class BookSpider(Args[MyParams], scrapy.Spider):
        name = "bookspider"
        start_urls = ["http://books.toscrape.com/catalogue"]

        async def start(self):
            for start_url in self.start_urls:
                for index in range(1, self.args.pages + 1):
                    yield scrapy.Request(f"{start_url}/page-{index}.html")

        def parse(self, response):
            book_links = response.css("article.product_pod h3 a::attr(href)").getall()
            for book_link in book_links:
                yield response.follow(book_link, self.parse_book)

        def parse_book(self, response):
            yield {
                "title": response.css("h1::text").get(),
                "price": response.css("p.price_color::text").get(),
            }

This spider can be called from the command line::

    scrapy crawl bookspider -a pages=2

.. _start-requests:

Start requests
==============

**Start requests** are :class:`~scrapy.Request` objects yielded from the
:meth:`~scrapy.Spider.start` method of a spider or from the
:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_start` method of a
:ref:`spider middleware <topics-spider-middleware>`.

.. seealso:: :ref:`start-request-order`

.. _start-requests-lazy:

Delaying start request iteration
--------------------------------

You can override the :meth:`~scrapy.Spider.start` method as follows to pause
its iteration whenever there are scheduled requests:

.. code-block:: python

    async def start(self):
        async for item_or_request in super().start():
            if self.crawler.engine.needs_backout():
                await self.crawler.signals.wait_for(signals.scheduler_empty)
            yield item_or_request

This can help minimize the number of requests in the scheduler at any given
time, to minimize resource usage (memory or disk, depending on
:setting:`JOBDIR`).

.. _builtin-spiders:

Generic Spiders
===============

Scrapy comes with some useful generic spiders that you can use to subclass
your spiders from. Their aim is to provide convenient functionality for a few
common scraping cases, like following all links on a site based on certain
rules, crawling from `Sitemaps`_, or parsing an XML/CSV feed.

For the examples used in the following spiders, we'll assume you have a project
with a ``TestItem`` declared in a ``myproject.items`` module:

.. code-block:: python

    import scrapy

    class TestItem(scrapy.Item):
        id = scrapy.Field()
        name = scrapy.Field()
        description = scrapy.Field()

.. currentmodule:: scrapy.spiders

CrawlSpider
-----------

.. class:: CrawlSpider

   This is the most commonly used spider for crawling regular websites, as it
   provides a convenient mechanism for following links by defining a set of rules.
   It may not be the best suited for your particular web sites or project, but
   it's generic enough for several cases, so you can start from it and override it
   as needed for more custom functionality, or just implement your own spider.

   Apart from the attributes inherited from Spider (that you must
   specify), this class supports a new attribute:

   .. attribute:: rules

       Which is a list of one (or more) :class:`Rule` objects.  Each :class:`Rule`
       defines a certain behaviour for crawling the site. Rules objects are
       described below. If multiple rules match the same link, the first one
       will be used, according to the order they're defined in this attribute.

   This spider also exposes an overridable method:

   .. method:: parse_start_url(response, **kwargs)

      This method is called for each response produced for the URLs in
      the spider's ``start_urls`` attribute. It allows to parse
      the initial responses and must return either an
      :ref:`item object <topics-items>`, a :class:`~scrapy.Request`
      object, or an iterable containing any of them.

Crawling rules
~~~~~~~~~~~~~~

.. autoclass:: Rule

   ``link_extractor`` is a :ref:`Link Extractor <topics-link-extractors>` object which
   defines how links will be extracted from each crawled page. Each produced link will
   be used to generate a :class:`~scrapy.Request` object, which will contain the
   link's text in its ``meta`` dictionary (under the ``link_text`` key).
   If omitted, a default link extractor created with no arguments will be used,
   resulting in all links being extracted.

   ``callback`` is a callable or a string (in which case a method from the spider
   object with that name will be used) to be called for each link extracted with
   the specified link extractor. This callback receives a :class:`~scrapy.http.Response`
   as its first argument and must return either a single instance or an iterable of
   :ref:`item objects <topics-items>` and/or :class:`~scrapy.Request` objects
   (or any subclass of them). As mentioned above, the received :class:`~scrapy.http.Response`
   object will contain the text of the link that produced the :class:`~scrapy.Request`
   in its ``meta`` dictionary (under the ``link_text`` key)

   ``cb_kwargs`` is a dict containing the keyword arguments to be passed to the
   callback function.

   ``follow`` is a boolean which specifies if links should be followed from each
   response extracted with this rule. If ``callback`` is None ``follow`` defaults
   to ``True``, otherwise it defaults to ``False``.

   ``process_links`` is a callable, or a string (in which case a method from the
   spider object with that name will be used) which will be called for each list
   of links extracted from each response using the specified ``link_extractor``.
   This is mainly used for filtering purposes.

   ``process_request`` is a callable (or a string, in which case a method from
   the spider object with that name will be used) which will be called for every
   :class:`~scrapy.Request` extracted by this rule. This callable should
   take said request as first argument and the :class:`~scrapy.http.Response`
   from which the request originated as second argument. It must return a
   ``Request`` object or ``None`` (to filter out the request).

   ``errback`` is a callable or a string (in which case a method from the spider
   object with that name will be used) to be called if any exception is
   raised while processing a request generated by the rule.
   It receives a :class:`Twisted Failure <twisted.python.failure.Failure>`
   instance as first parameter.

   .. warning:: Because of its internal implementation, you must explicitly set
      callbacks for new requests when writing :class:`CrawlSpider`-based spiders;
      unexpected behaviour can occur otherwise.

CrawlSpider example
~~~~~~~~~~~~~~~~~~~

Let's now take a look at an example CrawlSpider with rules:

.. code-block:: python

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor

    class MySpider(CrawlSpider):
        name = "example.com"
        allowed_domains = ["example.com"]
        start_urls = ["http://www.example.com"]

        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=(r"category\.php",), deny=(r"subsection\.php",))),
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=(r"item\.php",)), callback="parse_item"),
        )

        def parse_item(self, response):
            self.logger.info("Hi, this is an item page! %s", response.url)
            item = scrapy.Item()
            item["id"] = response.xpath('//td[@id="item_id"]/text()').re(r"ID: (\d+)")
            item["name"] = response.xpath('//td[@id="item_name"]/text()').get()
            item["description"] = response.xpath(
                '//td[@id="item_description"]/text()'
            ).get()
            item["link_text"] = response.meta["link_text"]
            url = response.xpath('//td[@id="additional_data"]/@href').get()
            return response.follow(
                url, self.parse_additional_page, cb_kwargs=dict(item=item)
            )

        def parse_additional_page(self, response, item):
            item["additional_data"] = response.xpath(
                '//p[@id="additional_data"]/text()'
            ).get()
            return item

This spider would start crawling example.com's home page, collecting category
links, and item links, parsing the latter with the ``parse_item`` method. For
each item response, some data will be extracted from the HTML using XPath, and
an :class:`~scrapy.Item` will be filled with it.

XMLFeedSpider
-------------

.. class:: XMLFeedSpider

    XMLFeedSpider is designed for parsing XML feeds by iterating through them by a
    certain node name.  The iterator can be chosen from: ``iternodes``, ``xml``,
    and ``html``.  It's recommended to use the ``iternodes`` iterator for
    performance reasons, since the ``xml`` and ``html`` iterators generate the
    whole DOM at once in order to parse it.  However, using ``html`` as the
    iterator may be useful when parsing XML with bad markup.

    To set the iterator and the tag name, you must define the following class
    attributes:

    .. attribute:: iterator

        A string which defines the iterator to use. It can be either:

           - ``'iternodes'`` - a fast iterator based on regular expressions

           - ``'html'`` - an iterator which uses :class:`~scrapy.Selector`.
             Keep in mind this uses DOM parsing and must load all DOM in memory
             which could be a problem for big feeds

           - ``'xml'`` - an iterator which uses :class:`~scrapy.Selector`.
             Keep in mind this uses DOM parsing and must load all DOM in memory
             which could be a problem for big feeds

        It defaults to: ``'iternodes'``.

    .. attribute:: itertag

        A string with the name of the node (or element) to iterate in. Example::

            itertag = 'product'

    .. attribute:: namespaces

        A list of ``(prefix, uri)`` tuples which define the namespaces
        available in that document that will be processed with this spider. The
        ``prefix`` and ``uri`` will be used to automatically register
        namespaces using the
        :meth:`~scrapy.Selector.register_namespace` method.

        You can then specify nodes with namespaces in the :attr:`itertag`
        attribute.

        Example::

            class YourSpider(XMLFeedSpider):

                namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
                itertag = 'n:url'
                # ...

    Apart from these new attributes, this spider has the following overridable
    methods too:

    .. method:: adapt_response(response)

        A method that receives the response as soon as it arrives from the spider
        middleware, before the spider starts parsing it. It can be used to modify
        the response body before parsing it. This method receives a response and
        also returns a response (it could be the same or another one).

    .. method:: parse_node(response, selector)

        This method is called for the nodes matching the provided tag name
        (``itertag``).  Receives the response and an
        :class:`~scrapy.Selector` for each node.  Overriding this
        method is mandatory. Otherwise, your spider won't work.  This method
        must return an :ref:`item object <topics-items>`, a
        :class:`~scrapy.Request` object, or an iterable containing any of
        them.

    .. method:: process_results(response, results)

        This method is called for each result (item or request) returned by the
        spider, and it's intended to perform any last time processing required
        before returning the results to the framework core, for example setting the
        item IDs. It receives a list of results and the response which originated
        those results. It must return a list of results (items or requests).

    .. warning:: Because of its internal implementation, you must explicitly set
       callbacks for new requests when writing :class:`XMLFeedSpider`-based spiders;
       unexpected behaviour can occur otherwise.

XMLFeedSpider example
~~~~~~~~~~~~~~~~~~~~~

These spiders are pretty easy to use, let's have a look at one example:

.. skip: next
.. code-block:: python

    from scrapy.spiders import XMLFeedSpider
    from myproject.items import TestItem

    class MySpider(XMLFeedSpider):
        name = "example.com"
        allowed_domains = ["example.com"]
        start_urls = ["http://www.example.com/feed.xml"]
        iterator = "iternodes"  # This is actually unnecessary, since it's the default value
        itertag = "item"

        def parse_node(self, response, node):
            self.logger.info(
                "Hi, this is a <%s> node!: %s", self.itertag, "".join(node.getall())
            )

            item = TestItem()
            item["id"] = node.xpath("@id").get()
            item["name"] = node.xpath("name").get()
            item["description"] = node.xpath("description").get()
            return item

Basically what we did up there was to create a spider that downloads a feed from
the given ``start_urls``, and then iterates through each of its ``item`` tags,
prints them out, and stores some random data in an :class:`~scrapy.Item`.

CSVFeedSpider
-------------

.. class:: CSVFeedSpider

   This spider is very similar to the XMLFeedSpider, except that it iterates
   over rows, instead of nodes. The method that gets called in each iteration
   is :meth:`parse_row`.

   .. attribute:: delimiter

       A string with the separator character for each field in the CSV file
       Defaults to ``','`` (comma).

   .. attribute:: quotechar

       A string with the enclosure character for each field in the CSV file
       Defaults to ``'"'`` (quotation mark).

   .. attribute:: headers

       A list of the column names in the CSV file.

   .. method:: parse_row(response, row)

       Receives a response and a dict (representing each row) with a key for each
       provided (or detected) header of the CSV file.  This spider also gives the
       opportunity to override ``adapt_response`` and ``process_results`` methods
       for pre- and post-processing purposes.

CSVFeedSpider example
~~~~~~~~~~~~~~~~~~~~~

Let's see an example similar to the previous one, but using a
:class:`CSVFeedSpider`:

.. skip: next
.. code-block:: python

    from scrapy.spiders import CSVFeedSpider
    from myproject.items import TestItem

    class MySpider(CSVFeedSpider):
        name = "example.com"
        allowed_domains = ["example.com"]
        start_urls = ["http://www.example.com/feed.csv"]
        delimiter = ";"
        quotechar = "'"
        headers = ["id", "name", "description"]

        def parse_row(self, response, row):
            self.logger.info("Hi, this is a row!: %r", row)

            item = TestItem()
            item["id"] = row["id"]
            item["name"] = row["name"]
            item["description"] = row["description"]
            return item

SitemapSpider
-------------

.. class:: SitemapSpider

    SitemapSpider allows you to crawl a site by discovering the URLs using
    `Sitemaps`_.

    It supports nested sitemaps and discovering sitemap urls from
    `robots.txt`_.

    .. attribute:: sitemap_urls

        A list of urls pointing to the sitemaps whose urls you want to crawl.

        You can also point to a `robots.txt`_ and it will be parsed to extract
        sitemap urls from it.

    .. attribute:: sitemap_rules

        A list of tuples ``(regex, callback)`` where:

        * ``regex`` is a regular expression to match urls extracted from sitemaps.
          ``regex`` can be either a str or a compiled regex object.

        * callback is the callback to use for processing the urls that match
          the regular expression. ``callback`` can be a string (indicating the
          name of a spider method) or a callable.

        For example::

            sitemap_rules = [('/product/', 'parse_product')]

        Rules are applied in order, and only the first one that matches will be
        used.

        If you omit this attribute, all urls found in sitemaps will be
        processed with the ``parse`` callback.

    .. attribute:: sitemap_follow

        A list of regexes of sitemap that should be followed. This is only
        for sites that use `Sitemap index files`_ that point to other sitemap
        files.

        By default, all sitemaps are followed.

    .. attribute:: sitemap_alternate_links

        Specifies if alternate links for one ``url`` should be followed. These
        are links for the same website in another language passed within
        the same ``url`` block.

        For example::

            <url>
                <loc>http://example.com/</loc>
                <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
            </url>

        With ``sitemap_alternate_links`` set, this would retrieve both URLs. With
        ``sitemap_alternate_links`` disabled, only ``http://example.com/`` would be
        retrieved.

        Default is ``sitemap_alternate_links`` disabled.

    .. method:: sitemap_filter(entries)

        This is a filter function that could be overridden to select sitemap entries
        based on their attributes.

        For example::

            <url>
                <loc>http://example.com/</loc>
                <lastmod>2005-01-01</lastmod>
            </url>

        We can define a ``sitemap_filter`` function to filter ``entries`` by date:

        .. code-block:: python

            from datetime import datetime
            from scrapy.spiders import SitemapSpider

            class FilteredSitemapSpider(SitemapSpider):
                name = "filtered_sitemap_spider"
                allowed_domains = ["example.com"]
                sitemap_urls = ["http://example.com/sitemap.xml"]

                def sitemap_filter(self, entries):
                    for entry in entries:
                        date_time = datetime.strptime(entry["lastmod"], "%Y-%m-%d")
                        if date_time.year >= 2005:
                            yield entry

        This would retrieve only ``entries`` modified on 2005 and the following
        years.

        Entries are dict objects extracted from the sitemap document.
        Usually, the key is the tag name and the value is the text inside it.

        It's important to notice that:

        - as the loc attribute is required, entries without this tag are discarded
        - alternate links are stored in a list with the key ``alternate``
          (see ``sitemap_alternate_links``)
        - namespaces are removed, so lxml tags named as ``{namespace}tagname`` become only ``tagname``

        If you omit this method, all entries found in sitemaps will be
        processed, observing other attributes and their settings.

SitemapSpider examples
~~~~~~~~~~~~~~~~~~~~~~

Simplest example: process all urls discovered through sitemaps using the
``parse`` callback:

.. code-block:: python

    from scrapy.spiders import SitemapSpider

    class MySpider(SitemapSpider):
        sitemap_urls = ["http://www.example.com/sitemap.xml"]

        def parse(self, response):
            pass  # ... scrape item here ...

Process some urls with certain callback and other urls with a different
callback:

.. code-block:: python

    from scrapy.spiders import SitemapSpider

    class MySpider(SitemapSpider):
        sitemap_urls = ["http://www.example.com/sitemap.xml"]
        sitemap_rules = [
            ("/product/", "parse_product"),
            ("/category/", "parse_category"),
        ]

        def parse_product(self, response):
            pass  # ... scrape product ...

        def parse_category(self, response):
            pass  # ... scrape category ...

Follow sitemaps defined in the `robots.txt`_ file and only follow sitemaps
whose url contains ``/sitemap_shop``:

.. code-block:: python

    from scrapy.spiders import SitemapSpider

    class MySpider(SitemapSpider):
        sitemap_urls = ["http://www.example.com/robots.txt"]
        sitemap_rules = [
            ("/shop/", "parse_shop"),
        ]
        sitemap_follow = ["/sitemap_shops"]

        def parse_shop(self, response):
            pass  # ... scrape shop here ...

Combine SitemapSpider with other sources of urls:

.. code-block:: python

    from scrapy.spiders import SitemapSpider

    class MySpider(SitemapSpider):
        sitemap_urls = ["http://www.example.com/robots.txt"]
        sitemap_rules = [
            ("/shop/", "parse_shop"),
        ]

        other_urls = ["http://www.example.com/about"]

        async def start(self):
            async for item_or_request in super().start():
                yield item_or_request
            for url in self.other_urls:
                yield Request(url, self.parse_other)

        def parse_shop(self, response):
            pass  # ... scrape shop here ...

        def parse_other(self, response):
            pass  # ... scrape other here ...

.. _scrapy-spider-metadata: https://scrapy-spider-metadata.readthedocs.io/en/latest/params.html
.. _Sitemaps: https://www.sitemaps.org/index.html
.. _Sitemap index files: https://www.sitemaps.org/protocol.html#index
.. _robots.txt: https://www.robotstxt.org/
.. _TLD: https://en.wikipedia.org/wiki/Top-level_domain
.. _Scrapyd documentation: https://scrapyd.readthedocs.io/en/latest/


.. _topics-stats:

================
Stats Collection
================

Scrapy provides a convenient facility for collecting stats in the form of
key/values, where values are often counters. The facility is called the Stats
Collector, and can be accessed through the :attr:`~scrapy.crawler.Crawler.stats`
attribute of the :ref:`topics-api-crawler`, as illustrated by the examples in
the :ref:`topics-stats-usecases` section below.

However, the Stats Collector is always available, so you can always import it
in your module and use its API (to increment or set new stat keys), regardless
of whether the stats collection is enabled or not. If it's disabled, the API
will still work but it won't collect anything. This is aimed at simplifying the
stats collector usage: you should spend no more than one line of code for
collecting stats in your spider, Scrapy extension, or whatever code you're
using the Stats Collector from.

Another feature of the Stats Collector is that it's very efficient (when
enabled) and extremely efficient (almost unnoticeable) when disabled.

The Stats Collector keeps a stats table per open spider which is automatically
opened when the spider is opened, and closed when the spider is closed.

.. _topics-stats-usecases:

Common Stats Collector uses
===========================

Access the stats collector through the :attr:`~scrapy.crawler.Crawler.stats`
attribute. Here is an example of an extension that access stats:

.. code-block:: python

    class ExtensionThatAccessStats:
        def __init__(self, stats):
            self.stats = stats

        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler.stats)

.. skip: start

Set stat value:

.. code-block:: python

    stats.set_value("hostname", socket.gethostname())

Increment stat value:

.. code-block:: python

    stats.inc_value("custom_count")

Set stat value only if greater than previous:

.. code-block:: python

    stats.max_value("max_items_scraped", value)

Set stat value only if lower than previous:

.. code-block:: python

    stats.min_value("min_free_memory_percent", value)

Get stat value:

.. code-block:: pycon

    >>> stats.get_value("custom_count")
    1

Get all stats:

.. code-block:: pycon

    >>> stats.get_stats()
    {'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}

.. skip: end

Available Stats Collectors
==========================

Besides the basic :class:`StatsCollector` there are other Stats Collectors
available in Scrapy which extend the basic Stats Collector. You can select
which Stats Collector to use through the :setting:`STATS_CLASS` setting. The
default Stats Collector used is the :class:`MemoryStatsCollector`.

.. currentmodule:: scrapy.statscollectors

MemoryStatsCollector
--------------------

.. class:: MemoryStatsCollector

    A simple stats collector that keeps the stats of the last scraping run (for
    each spider) in memory, after they're closed. The stats can be accessed
    through the :attr:`spider_stats` attribute, which is a dict keyed by spider
    domain name.

    This is the default Stats Collector used in Scrapy.

    .. attribute:: spider_stats

       A dict of dicts (keyed by spider name) containing the stats of the last
       scraping run for each spider.

DummyStatsCollector
-------------------

.. class:: DummyStatsCollector

    A Stats collector which does nothing but is very efficient (because it does
    nothing). This stats collector can be set via the :setting:`STATS_CLASS`
    setting, to disable stats collect in order to improve performance. However,
    the performance penalty of stats collection is usually marginal compared to
    other Scrapy workload like parsing pages.


.. currentmodule:: scrapy.extensions.telnet

.. _topics-telnetconsole:

==============
Telnet Console
==============

Scrapy comes with a built-in telnet console for inspecting and controlling a
Scrapy running process. The telnet console is just a regular python shell
running inside the Scrapy process, so you can do literally anything from it.

The telnet console is a :ref:`built-in Scrapy extension
<topics-extensions-ref>` which comes enabled by default, but you can also
disable it if you want. For more information about the extension itself see
:ref:`topics-extensions-ref-telnetconsole`.

.. warning::
    It is not secure to use telnet console via public networks, as telnet
    doesn't provide any transport-layer security. Having username/password
    authentication doesn't change that.

    Intended usage is connecting to a running Scrapy spider locally
    (spider process and telnet client are on the same machine)
    or over a secure connection (VPN, SSH tunnel).
    Please avoid using telnet console over insecure connections,
    or disable it completely using :setting:`TELNETCONSOLE_ENABLED` option.

.. note::
    This feature is not supported when :setting:`TWISTED_REACTOR_ENABLED` is ``False``.

.. highlight:: none

How to access the telnet console
================================

The telnet console listens in the TCP port defined in the
:setting:`TELNETCONSOLE_PORT` setting, which defaults to ``6023``. To access
the console you need to type::

    telnet localhost 6023
    Trying localhost...
    Connected to localhost.
    Escape character is '^]'.
    Username:
    Password:
    >>>

By default Username is ``scrapy`` and Password is autogenerated. The
autogenerated Password can be seen on Scrapy logs like the example below::

    2018-10-16 14:35:21 [scrapy.extensions.telnet] INFO: Telnet Password: 16f92501e8a59326

Default Username and Password can be overridden by the settings
:setting:`TELNETCONSOLE_USERNAME` and :setting:`TELNETCONSOLE_PASSWORD`.

.. warning::
    Username and password provide only a limited protection, as telnet
    is not using secure transport - by default traffic is not encrypted
    even if username and password are set.

You need the telnet program which comes installed by default in Windows, and
most Linux distros.

.. _telnet-vars:

Available variables in the telnet console
=========================================

The telnet console is like a regular Python shell running inside the Scrapy
process, so you can do anything from it including importing new modules, etc.

However, the telnet console comes with some default variables defined for
convenience:

+----------------+-------------------------------------------------------------------+
| Shortcut       | Description                                                       |
+================+===================================================================+
| ``crawler``    | the Scrapy Crawler (:class:`scrapy.crawler.Crawler` object)       |
+----------------+-------------------------------------------------------------------+
| ``engine``     | Crawler.engine attribute                                          |
+----------------+-------------------------------------------------------------------+
| ``spider``     | the active spider                                                 |
+----------------+-------------------------------------------------------------------+
| ``extensions`` | the Extension Manager (Crawler.extensions attribute)              |
+----------------+-------------------------------------------------------------------+
| ``stats``      | the Stats Collector (Crawler.stats attribute)                     |
+----------------+-------------------------------------------------------------------+
| ``settings``   | the Scrapy settings object (Crawler.settings attribute)           |
+----------------+-------------------------------------------------------------------+
| ``est``        | print a report of the engine status                               |
+----------------+-------------------------------------------------------------------+
| ``prefs``      | for memory debugging (see :ref:`topics-leaks`)                    |
+----------------+-------------------------------------------------------------------+
| ``p``          | a shortcut to the :func:`pprint.pprint` function                  |
+----------------+-------------------------------------------------------------------+
| ``hpy``        | for memory debugging (see :ref:`topics-leaks`)                    |
+----------------+-------------------------------------------------------------------+

Telnet console usage examples
=============================

.. skip: start

Here are some example tasks you can do with the telnet console:

View engine status
------------------

You can use the ``est()`` method of the Scrapy engine to quickly show its state
using the telnet console::

    telnet localhost 6023
    >>> est()
    Execution engine status

    time()-engine.start_time                        : 8.62972998619
    len(engine.downloader.active)                   : 16
    engine.scraper.is_idle()                        : False
    engine.spider.name                              : followall
    engine.spider_is_idle()                         : False
    engine._slot.closing                            : False
    len(engine._slot.inprogress)                    : 16
    len(engine._slot.scheduler.dqs or [])           : 0
    len(engine._slot.scheduler.mqs)                 : 92
    len(engine.scraper.slot.queue)                  : 0
    len(engine.scraper.slot.active)                 : 0
    engine.scraper.slot.active_size                 : 0
    engine.scraper.slot.itemproc_size               : 0
    engine.scraper.slot.needs_backout()             : False

Pause, resume and stop the Scrapy engine
----------------------------------------

To pause::

    telnet localhost 6023
    >>> engine.pause()
    >>>

To resume::

    telnet localhost 6023
    >>> engine.unpause()
    >>>

To stop::

    telnet localhost 6023
    >>> engine.stop()
    Connection closed by foreign host.

.. skip: end

Telnet Console signals
======================

.. signal:: update_telnet_vars
.. function:: update_telnet_vars(telnet_vars)

    Sent just before the telnet console is opened. You can hook up to this
    signal to add, remove or update the variables that will be available in the
    telnet local namespace. In order to do that, you need to update the
    ``telnet_vars`` dict in your handler.

    :param telnet_vars: the dict of telnet variables
    :type telnet_vars: dict

Telnet settings
===============

These are the settings that control the telnet console's behaviour:

.. setting:: TELNETCONSOLE_PORT

TELNETCONSOLE_PORT
------------------

Default: ``[6023, 6073]``

The port range to use for the telnet console. If set to ``None``, a dynamically
assigned port is used.

.. setting:: TELNETCONSOLE_HOST

TELNETCONSOLE_HOST
------------------

Default: ``'127.0.0.1'``

The interface the telnet console should listen on

.. setting:: TELNETCONSOLE_USERNAME

TELNETCONSOLE_USERNAME
----------------------

Default: ``'scrapy'``

The username used for the telnet console

.. setting:: TELNETCONSOLE_PASSWORD

TELNETCONSOLE_PASSWORD
----------------------

Default: ``None``

The password used for the telnet console, default behaviour is to have it
autogenerated


.. _faq:

Frequently Asked Questions
==========================

.. _faq-scrapy-bs-cmp:

How does Scrapy compare to BeautifulSoup or lxml?
-------------------------------------------------

`BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is
an application framework for writing web spiders that crawl web sites and
extract data from them.

Scrapy provides a built-in mechanism for extracting data (called
:ref:`selectors <topics-selectors>`) but you can easily use `BeautifulSoup`_
(or `lxml`_) instead, if you feel more comfortable working with them. After
all, they're just parsing libraries which can be imported and used from any
Python code.

In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
comparing `jinja2`_ to `Django`_.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: https://lxml.de/
.. _jinja2: https://palletsprojects.com/projects/jinja/
.. _Django: https://www.djangoproject.com/

Can I use Scrapy with BeautifulSoup?
------------------------------------

Yes, you can.
As mentioned :ref:`above <faq-scrapy-bs-cmp>`, `BeautifulSoup`_ can be used
for parsing HTML responses in Scrapy callbacks.
You just have to feed the response's body into a ``BeautifulSoup`` object
and extract whatever data you need from it.

Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML parser:

.. skip: next
.. code-block:: python

    from bs4 import BeautifulSoup
    import scrapy

    class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = ("http://www.example.com/",)

        def parse(self, response):
            # use lxml to get decent HTML parsing speed
            soup = BeautifulSoup(response.text, "lxml")
            yield {"url": response.url, "title": soup.h1.string}

.. note::

    ``BeautifulSoup`` supports several HTML/XML parsers.
    See `BeautifulSoup's official documentation`_ on which ones are available.

.. _BeautifulSoup's official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

Did Scrapy "steal" X from Django?
---------------------------------

Probably, but we don't like that word. We think Django_ is a great open source
project and an example to follow, so we've used it as an inspiration for
Scrapy.

We believe that, if something is already done well, there's no need to reinvent
it. This concept, besides being one of the foundations for open source and free
software, not only applies to software but also to documentation, procedures,
policies, etc. So, instead of going through each problem ourselves, we choose
to copy ideas from those projects that have already solved them properly, and
focus on the real problems we need to solve.

We'd be proud if Scrapy serves as an inspiration for other projects. Feel free
to steal from us!

Does Scrapy work with HTTP proxies?
-----------------------------------

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP
Proxy downloader middleware. See
:class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`.

How can I scrape an item with attributes in different pages?
------------------------------------------------------------

See :ref:`topics-request-response-ref-request-callback-arguments`.

How can I simulate a user login in my spider?
---------------------------------------------

See :ref:`topics-request-response-ref-request-userlogin`.

.. _faq-bfo-dfo:

Does Scrapy crawl in breadth-first or depth-first order?
--------------------------------------------------------

:ref:`DFO by default, but other orders are possible <request-order>`.

My Scrapy crawler has memory leaks. What can I do?
--------------------------------------------------

See :ref:`topics-leaks`.

Also, Python has a builtin memory leak issue which is described in
:ref:`topics-leaks-without-leaks`.

How can I make Scrapy consume less memory?
------------------------------------------

See previous question.

How can I prevent memory errors due to many allowed domains?
------------------------------------------------------------

If you have a spider with a long list of :attr:`~scrapy.Spider.allowed_domains`
(e.g. 50,000+), consider replacing the default
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` downloader
middleware with a :ref:`custom downloader middleware
<topics-downloader-middleware-custom>` that requires less memory. For example:

-   If your domain names are similar enough, use your own regular expression
    instead joining the strings in :attr:`~scrapy.Spider.allowed_domains` into
    a complex regular expression.

-   If you can meet the installation requirements, use pyre2_ instead of
    Python’s re_ to compile your URL-filtering regular expression. See
    :issue:`1908`.

See also `other suggestions at StackOverflow
<https://stackoverflow.com/q/36440681>`__.

.. note:: Remember to disable
   :class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` when you
   enable your custom implementation:

   .. code-block:: python

       DOWNLOADER_MIDDLEWARES = {
           "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
           "myproject.middlewares.CustomOffsiteMiddleware": 50,
       }

.. _pyre2: https://github.com/andreasvc/pyre2
.. _re: https://docs.python.org/3/library/re.html

Can I use Basic HTTP Authentication in my spiders?
--------------------------------------------------

Yes, see :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`.

Why does Scrapy download pages in English instead of my native language?
------------------------------------------------------------------------

Try changing the default `Accept-Language`_ request header by overriding the
:setting:`DEFAULT_REQUEST_HEADERS` setting.

.. _Accept-Language: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

Where can I find some example Scrapy projects?
----------------------------------------------

See :ref:`intro-examples`.

Can I run a spider without creating a project?
----------------------------------------------

Yes. You can use the :command:`runspider` command. For example, if you have a
spider written in a ``my_spider.py`` file you can run it with::

    scrapy runspider my_spider.py

See :command:`runspider` command for more info.

I get "Filtered offsite request" messages. How can I fix them?
--------------------------------------------------------------

Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
problem, so you may not need to fix them.

Those messages are thrown by
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, which is a
downloader middleware (enabled by default) whose purpose is to filter out
requests to domains outside the ones covered by the spider.

What is the recommended way to deploy a Scrapy crawler in production?
---------------------------------------------------------------------

See :ref:`topics-deploy`.

Can I use JSON for large exports?
---------------------------------

It'll depend on how large your output is. See :ref:`this warning
<json-with-large-data>` in :class:`~scrapy.exporters.JsonItemExporter`
documentation.

Can I return (Twisted) deferreds from signal handlers?
------------------------------------------------------

Some signals support returning deferreds from their handlers, others don't. See
the :ref:`topics-signals-ref` to know which ones.

What does the response status code 999 mean?
--------------------------------------------

999 is a custom response status code used by Yahoo sites to throttle requests.
Try slowing down the crawling speed by using a download delay of ``2`` (or
higher) in your spider:

.. code-block:: python

    from scrapy.spiders import CrawlSpider

    class MySpider(CrawlSpider):
        name = "myspider"

        download_delay = 2

        # [ ... rest of the spider code ... ]

Or by setting a global download delay in your project with the
:setting:`DOWNLOAD_DELAY` setting.

Can I call ``pdb.set_trace()`` from my spiders to debug them?
-------------------------------------------------------------

Yes, but you can also use the Scrapy shell which allows you to quickly analyze
(and even modify) the response being processed by your spider, which is, quite
often, more useful than plain old ``pdb.set_trace()``.

For more info see :ref:`topics-shell-inspect-response`.

Simplest way to dump all my scraped items into a JSON/CSV/XML file?
-------------------------------------------------------------------

To dump into a JSON file::

    scrapy crawl myspider -O items.json

To dump into a CSV file::

    scrapy crawl myspider -O items.csv

To dump into an XML file::

    scrapy crawl myspider -O items.xml

For more information see :ref:`topics-feed-exports`

What's this huge cryptic ``__VIEWSTATE`` parameter used in some forms?
----------------------------------------------------------------------

The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
more info on how it works see `this page`_. Also, here's an `example spider`_
which scrapes one of these sites.

.. _this page: https://metacpan.org/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/view/lib/HTML/TreeBuilderX/ASP_NET.pm
.. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py

What's the best way to parse big XML/CSV data feeds?
----------------------------------------------------

Parsing big feeds with XPath selectors can be problematic since they need to
build the DOM of the entire feed in memory, and this can be quite slow and
consume a lot of memory.

In order to avoid parsing all the entire feed at once in memory, you can use
the :func:`~scrapy.utils.iterators.xmliter_lxml` and
:func:`~scrapy.utils.iterators.csviter` functions. In fact, this is what
:class:`~scrapy.spiders.XMLFeedSpider` uses.

.. autofunction:: scrapy.utils.iterators.xmliter_lxml

.. autofunction:: scrapy.utils.iterators.csviter

Does Scrapy manage cookies automatically?
-----------------------------------------

Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
back on subsequent requests, like any regular web browser does.

For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.

How can I see the cookies being sent and received from Scrapy?
--------------------------------------------------------------

Enable the :setting:`COOKIES_DEBUG` setting.

How can I instruct a spider to stop itself?
-------------------------------------------

Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For
more info see: :exc:`~scrapy.exceptions.CloseSpider`.

How can I prevent my Scrapy bot from getting banned?
----------------------------------------------------

See :ref:`bans`.

Should I use spider arguments or settings to configure my spider?
-----------------------------------------------------------------

Both :ref:`spider arguments <spiderargs>` and :ref:`settings <topics-settings>`
can be used to configure your spider. There is no strict rule that mandates to
use one or the other, but settings are more suited for parameters that, once
set, don't change much, while spider arguments are meant to change more often,
even on each spider run and sometimes are required for the spider to run at all
(for example, to set the start url of a spider).

To illustrate with an example, assuming you have a spider that needs to log
into a site to scrape data, and you only want to scrape data from a certain
section of the site (which varies each time). In that case, the credentials to
log in would be settings, while the url of the section to scrape would be a
spider argument.

I'm scraping a XML document and my XPath selector doesn't return any items
--------------------------------------------------------------------------

You may need to remove namespaces. See :ref:`removing-namespaces`.

.. _faq-split-item:

How to split an item into multiple items in an item pipeline?
-------------------------------------------------------------

:ref:`Item pipelines <topics-item-pipeline>` cannot yield multiple items per
input item. :ref:`Create a spider middleware <custom-spider-middleware>`
instead, and use its
:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output`
method for this purpose. For example:

.. code-block:: python

    from copy import deepcopy

    from itemadapter import ItemAdapter
    from scrapy import Request

    class MultiplyItemsMiddleware:
        def process_spider_output(self, response, result):
            for item_or_request in result:
                if isinstance(item_or_request, Request):
                    continue
                adapter = ItemAdapter(item)
                for _ in range(adapter["multiply_by"]):
                    yield deepcopy(item)

Does Scrapy support IPv6 addresses?
-----------------------------------

Yes, by setting :setting:`TWISTED_DNS_RESOLVER` to ``scrapy.resolver.CachingHostnameResolver``.
Note that by doing so, you lose the ability to set a specific timeout for DNS requests
(the value of the :setting:`DNS_TIMEOUT` setting is ignored).

.. _faq-specific-reactor:

How to deal with ``<class 'ValueError'>: filedescriptor out of range in select()`` exceptions?
----------------------------------------------------------------------------------------------

This issue `has been reported`_ to appear when running broad crawls in macOS, where the default
Twisted reactor is :class:`twisted.internet.selectreactor.SelectReactor`. Switching to a
different reactor is possible by using the :setting:`TWISTED_REACTOR` setting.

.. _faq-stop-response-download:

How can I cancel the download of a given response?
--------------------------------------------------

In some situations, it might be useful to stop the download of a certain response.
For instance, sometimes you can determine whether or not you need the full contents
of a response by inspecting its headers or the first bytes of its body. In that case,
you could save resources by attaching a handler to the :class:`~scrapy.signals.bytes_received`
or :class:`~scrapy.signals.headers_received` signals and raising a
:exc:`~scrapy.exceptions.StopDownload` exception. Please refer to the
:ref:`topics-stop-response-download` topic for additional information and examples.

.. _faq-blank-request:

How can I make a blank request?
-------------------------------

.. code-block:: python

    from scrapy import Request

    blank_request = Request("data:,")

In this case, the URL is set to a data URI scheme. Data URLs allow you to include data
inline within web pages, similar to external resources. The "data:" scheme with an empty
content (",") essentially creates a request to a data URL without any specific content.

Running ``runspider`` I get ``error: No spider found in file: <filename>``
--------------------------------------------------------------------------

This may happen if your Scrapy project has a spider module with a name that
conflicts with the name of one of the `Python standard library modules`_, such
as ``csv.py`` or ``os.py``, or any `Python package`_ that you have installed.
See :issue:`2680`.

.. _has been reported: https://github.com/scrapy/scrapy/issues/2905
.. _Python standard library modules: https://docs.python.org/3/py-modindex.html
.. _Python package: https://pypi.org/


.. _news:

Release notes
=============

.. _release-2.15.0:

Scrapy 2.15.0 (2026-04-09)
--------------------------

Highlights:

-   Experimental support for running without a Twisted reactor

-   Experimental ``httpx``-based download handler

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   The built-in HTTP :ref:`download handlers <download-handlers-ref>` now
    raise Scrapy-specific exceptions instead of implementation-specific ones,
    see :ref:`download-handlers-exceptions`. This can affect user code that
    handles downloader exceptions, such as ``process_exception()`` methods of
    custom :ref:`downloader middlewares <topics-downloader-middleware-custom>`.
    (:issue:`7208`)

-   In order to fix a long-standing bug with handling of asynchronous storages,
    the following changes were made to media pipeline classes, which can impact
    some of the user code that subclasses them or calls their methods directly:

    - overrides of :meth:`scrapy.pipelines.media.MediaPipeline.media_downloaded`
      and :meth:`~scrapy.pipelines.files.FilesPipeline.file_downloaded` can now
      return coroutines

    - :meth:`~scrapy.pipelines.files.FilesPipeline.media_downloaded`,
      :meth:`~scrapy.pipelines.files.FilesPipeline.file_downloaded` and
      :meth:`~scrapy.pipelines.images.ImagesPipeline.image_downloaded` now
      return coroutines

    (:issue:`2183`, :issue:`6369`, :issue:`7182`)

-   ``Request`` and ``Response`` objects: ``__slots__`` and setter changes:

    -   :class:`scrapy.http.Request` and :class:`scrapy.http.Response` now
        define ``__slots__``. Assigning arbitrary attributes to instances (for
        example, ``response.foo = 1``) will raise ``AttributeError``. Store
        per-request/response data in the request/response ``meta`` mapping
        instead of attaching new attributes to the objects.

    -   If you maintain custom ``Request`` or ``Response`` subclasses that
        relied on dynamic instance attributes, either add ``'__dict__'`` to
        your subclass ``__slots__`` to allow dynamic attributes, or migrate
        per-instance state to ``meta`` or explicit documented attributes.

    -   The setters for ``headers``, ``flags`` and ``cookies`` no longer coerce
        falsy values into ``None``. For example, ``request.headers = {}`` now
        stores an empty :class:`scrapy.http.headers.Headers` instance (not
        ``None``), and ``request.flags = []`` remains an empty list instead of
        being set to ``None``. Update code that relied on ``is None`` checks or
        the previous coercion behaviour.

    (:issue:`7036`, :issue:`7367`, :issue:`7374`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   The context factory class set as the value of the
    ``DOWNLOADER_CLIENTCONTEXTFACTORY`` setting is now required to support the
    ``method`` argument of ``__init__()``, recommended since Scrapy 1.2.0.
    (:issue:`7353`)

Deprecations
~~~~~~~~~~~~

-   ``scrapy.mail.MailSender`` is deprecated. Please use :mod:`smtplib`,
    :mod:`twisted.mail.smtp` or other 3rd party email libraries.
    (:issue:`7249`, :issue:`7263`)

-   The ``scrapy.extensions.statsmailer.StatsMailer`` extension is deprecated.
    You can instead implement your own notifications by handling the
    :signal:`spider_closed` signal.
    (:issue:`7249`, :issue:`7263`)

-   The ``MEMUSAGE_NOTIFY_MAIL`` setting is deprecated. You can instead
    implement your own notifications by handling the
    :signal:`memusage_warning_reached` and :signal:`spider_closed` signals.
    (:issue:`7249`, :issue:`7263`)

-   The ``DNS_RESOLVER`` setting was renamed to :setting:`TWISTED_DNS_RESOLVER`
    and the old name is deprecated.
    (:issue:`7350`, :issue:`7361`)

-   The ``DOWNLOADER_CLIENTCONTEXTFACTORY`` setting is deprecated. If you were
    using it to switch to
    ``scrapy.core.downloader.contextfactory.BrowserLikeContextFactory``, please
    use the new :setting:`DOWNLOAD_VERIFY_CERTIFICATES` setting instead. If you
    cannot use the default context factory for some other reason, please
    subclass the :ref:`download handler <download-handlers-ref>` instead.
    (:issue:`7352`, :issue:`7379`)

-   ``scrapy.core.downloader.contextfactory.BrowserLikeContextFactory`` is
    deprecated. You can set the new :setting:`DOWNLOAD_VERIFY_CERTIFICATES`
    setting to ``True`` instead.
    (:issue:`7379`)

-   The following implementation details of the context factory handling code
    are deprecated:

    - ``scrapy.core.downloader.contextfactory.AcceptableProtocolsContextFactory``

    - ``scrapy.core.downloader.contextfactory.load_context_factory_from_settings()``

    - ``scrapy.core.downloader.contextfactory.ScrapyClientContextFactory``

    - ``scrapy.core.downloader.tls.ScrapyClientTLSOptions``

    (:issue:`7353`, :issue:`7391`)

-   Passing :class:`str` instead of :class:`bytes` to
    :class:`scrapy.utils.sitemap.Sitemap` and
    :func:`scrapy.utils.sitemap.sitemap_urls_from_robots` is deprecated.
    (:issue:`7007`)

-   ``scrapy.utils.misc.walk_modules()`` is deprecated. You can use
    :func:`scrapy.utils.misc.walk_modules_iter` instead.
    (:issue:`7388`)

-   ``scrapy.shell.Shell.inthread`` is deprecated. You can use
    :attr:`scrapy.shell.Shell.fetch_available` instead to check if
    :func:`~scrapy.shell.Shell.fetch` can be used.
    (:issue:`7395`)

-   ``scrapy.commands.ScrapyCommand.set_crawler()`` is deprecated.
    (:issue:`7276`)

New features
~~~~~~~~~~~~

-   Added an *experimental* mode for running Scrapy without installing a
    Twisted reactor: set :setting:`TWISTED_REACTOR_ENABLED` to ``False`` to
    enable it. This mode has limitations, refer to :ref:`its documentation
    <asyncio-without-reactor>` for details. As long as it's experimental, its
    behavior and related features and APIs may change in future Scrapy releases
    in a breaking way.
    (:issue:`6219`,
    :issue:`7185`,
    :issue:`7186`,
    :issue:`7187`,
    :issue:`7188`,
    :issue:`7190`,
    :issue:`7197`,
    :issue:`7199`,
    :issue:`7209`,
    :issue:`7228`,
    :issue:`7355`,
    :issue:`7366`,
    :issue:`7385`,
    :issue:`7395`)

-   Added the :func:`scrapy.utils.reactorless.is_reactorless` function that
    checks if there is a running asyncio event loop but no Twisted reactor.
    (:issue:`7185`, :issue:`7199`)

-   Changed :func:`scrapy.utils.asyncio.is_asyncio_available` to return
    ``True`` if there is a running asyncio loop, even if no Twisted reactor is
    installed.
    (:issue:`7185`, :issue:`7199`)

-   Added an *experimental* download handler that uses the httpx_ library and
    doesn't require a Twisted reactor:
    :class:`~scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler`. As
    long as it's experimental, its behavior may change in future Scrapy
    releases in a breaking way.
    (:issue:`6805`, :issue:`7239`, :issue:`7368`, :issue:`7384`)

    .. _httpx: https://www.python-httpx.org/

-   Added the :setting:`DOWNLOAD_BIND_ADDRESS` setting as a global counterpart
    to the per-request :reqmeta:`bindaddress` meta key.
    (:issue:`7266`, :issue:`7283`)

-   Added the :setting:`DOWNLOAD_VERIFY_CERTIFICATES` setting that can be set
    to ``True`` to make Scrapy abort HTTPS requests when the server certificate
    is invalid or doesn't match the domain.
    (:issue:`7379`)

-   The built-in HTTP :ref:`download handlers <download-handlers-ref>` now
    raise Scrapy-specific exceptions instead of implementation-specific ones,
    to allow unified handling of similar problems caused by different
    implementations. The default value of the :setting:`RETRY_EXCEPTIONS`
    setting was updated replacing Twisted-specific exceptions with these new
    ones. The exceptions:

    - :exc:`~scrapy.exceptions.CannotResolveHostError`

    - :exc:`~scrapy.exceptions.DownloadCancelledError`

    - :exc:`~scrapy.exceptions.DownloadConnectionRefusedError`

    - :exc:`~scrapy.exceptions.DownloadFailedError`

    - :exc:`~scrapy.exceptions.DownloadTimeoutError`

    - :exc:`~scrapy.exceptions.ResponseDataLossError`

    - :exc:`~scrapy.exceptions.UnsupportedURLSchemeError`

    (:issue:`7208`)

-   Added the :signal:`memusage_warning_reached` signal emitted by the
    :class:`~scrapy.extensions.memusage.MemoryUsage` extension when the memory
    usage reaches :setting:`MEMUSAGE_WARNING_MB`.
    (:issue:`7249`, :issue:`7263`)

-   Added
    :meth:`Headers.to_tuple_list() <scrapy.http.headers.Headers.to_tuple_list>`
    that returns headers as a list of ``(key, value)`` tuples.
    (:issue:`7239`)

-   :class:`~scrapy.core.downloader.handlers.s3.S3DownloadHandler` now uses the
    download handler configured for the ``"https"`` scheme to make requests
    instead of always using
    :class:`~scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler`.
    (:issue:`7369`, :issue:`7370`)

-   Added :func:`scrapy.utils.misc.walk_modules_iter` as a replacement for
    ``scrapy.utils.misc.walk_modules()`` that returns an iterable instead of a
    list.
    (:issue:`7388`)

Improvements
~~~~~~~~~~~~

-   :func:`asyncio.to_thread` is now used instead of
    :func:`twisted.internet.threads.deferToThread` in the built-in feed
    storages, media pipeline storages and the
    :func:`scrapy.utils.decorators.inthread` decorator when available.
    (:issue:`7183`, :issue:`7184`, :issue:`7349`)

-   Improved memory footprint of :class:`~scrapy.Request` and
    :class:`~scrapy.http.Response` objects by adding ``__slots__`` and omitting
    empty lists and dicts in some internal attributes.
    (:issue:`7036`, :issue:`7367`, :issue:`7374`)

-   :class:`~scrapy.core.downloader.contextfactory._ScrapyClientContextFactory`
    no longer mutates the SSL context, to avoid the behavior that was
    deprecated in pyOpenSSL 25.1.0.
    (:issue:`6859`, :issue:`7353`)

-   Improved memory usage of :class:`~scrapy.spiders.sitemap.SitemapSpider` and
    :class:`scrapy.utils.sitemap.Sitemap`.
    (:issue:`3529`, :issue:`7007`)

-   Improved the scheduling behavior of
    :class:`~scrapy.pqueues.DownloaderAwarePriorityQueue` when crawling
    multiple domains.
    (:issue:`7293`, :issue:`7351`)

-   :class:`~scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler` and
    :class:`~scrapy.core.downloader.handlers.http2.H2DownloadHandler` now handle
    TLS verbose logging (see :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING`)
    directly instead of relying on
    :class:`~scrapy.core.downloader.contextfactory._ScrapyClientContextFactory`.
    (:issue:`7387`)

-   The server certificate verification code now correctly handles certificates
    with IP addresses in ``subjectAltName``.
    (:issue:`7353`)

-   Improved reliability of :func:`scrapy.utils.trackref.get_oldest`.
    (:issue:`1758`, :issue:`7375`)

-   Other code refactoring and improvements.
    (:issue:`7210`, :issue:`7238`, :issue:`7376`, :issue:`7386`, :issue:`7395`,
    :issue:`7405`, :issue:`7410`)

Bug fixes
~~~~~~~~~

-   :ref:`Media pipelines <topics-media-pipeline>` should now wait for uploads
    to asynchronous storages (e.g.
    :class:`~scrapy.pipelines.files.S3FilesStore`) to complete.
    (:issue:`2183`, :issue:`6369`, :issue:`7182`)

-   Fixed merging ``*_BASE`` settings (e.g. merging
    :setting:`DOWNLOADER_MIDDLEWARES` with
    :setting:`DOWNLOADER_MIDDLEWARES_BASE`) when a component is referred to by
    a class object in one setting and by a string import path in the other one.
    (:issue:`6912`, :issue:`6993`)

-   ``scrapy runspider`` and ``scrapy crawl`` now set the exit code to 1 if an
    exception happened early (this was broken since Scrapy 2.13.0).
    (:issue:`6820`, :issue:`7255`)

-   Fixed repeated warnings about data loss (see
    :setting:`DOWNLOAD_FAIL_ON_DATALOSS`) not being suppressed in
    :class:`~scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler`.
    (:issue:`7222`)

-   Improved FTP connection management in
    :class:`scrapy.pipelines.files.FTPFilesStore`.
    (:issue:`7256`)

-   Fixed the ``spider`` variable in the :ref:`shell <topics-shell>`, which
    wasn't available since Scrapy 2.13.0.
    (:issue:`7395`)

Documentation
~~~~~~~~~~~~~

-   The ``llms.txt`` and ``llms-full.txt`` files and Markdown versions of pages
    are now generated when the HTML documentation is built.
    (:issue:`7380`)

-   Added a "Copy as Markdown" button to the HTML documentation.
    (:issue:`7380`)

-   Added :ref:`docs for using Pydantic models as items <pydantic-items>`.
    (:issue:`6955`, :issue:`6966`)

-   Documented :ref:`job directory contents <job-dir-contents>`.
    (:issue:`4842`, :issue:`5260`)

-   Improved docs for :attr:`~scrapy.Request.dont_filter`.
    (:issue:`6398`, :issue:`7245`)

-   Clarified that settings related to :setting:`TWISTED_DNS_RESOLVER` are only
    taken into account if the selected resolver supports them.
    (:issue:`7385`)

-   Other documentation improvements and fixes.
    (:issue:`7248`, :issue:`7274`, :issue:`7406`, :issue:`7408`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added the ``no-reactor`` test environment that doesn't install a Twisted
    reactor and uses ``pytest-asyncio`` instead of ``pytest-twisted`` to run
    asynchronous test functions.
    (:issue:`6952`, :issue:`7189`, :issue:`7233`, :issue:`7234`, :issue:`7254`,
    :issue:`7259`)

-   Fixed running tests with ``pytest-xdist``.
    (:issue:`7216`, :issue:`7257`)

-   Type hints improvements and fixes.
    (:issue:`7300`, :issue:`7331`)

-   CI and test improvements and fixes.
    (:issue:`7060`,
    :issue:`7223`,
    :issue:`7232`,
    :issue:`7241`,
    :issue:`7250`,
    :issue:`7256`,
    :issue:`7276`,
    :issue:`7277`,
    :issue:`7279`,
    :issue:`7329`,
    :issue:`7363`,
    :issue:`7381`,
    :issue:`7402`)

.. _release-2.14.2:

Scrapy 2.14.2 (2026-03-12)
--------------------------

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Values from the ``Referrer-Policy`` header of HTTP responses are no longer
    executed as Python callables. See the `cwxj-rr6w-m6w7`_ security advisory
    for details.

    .. _cwxj-rr6w-m6w7: https://github.com/scrapy/scrapy/security/advisories/GHSA-cwxj-rr6w-m6w7

-   In line with the `standard
    <https://fetch.spec.whatwg.org/#http-redirect-fetch>`__, 301 redirects of
    ``POST`` requests are converted into ``GET`` requests.

    Converting to a ``GET`` request implies not only a method change, but also
    omitting the body and ``Content-*`` headers in the redirect request. On
    cross-origin redirects (for example, cross-domain redirects), this is
    effectively a security bug fix for scenarios where the body contains
    secrets.

Deprecations
~~~~~~~~~~~~

-   Passing a response URL string as the first positional argument to
    :meth:`scrapy.spidermiddlewares.referer.RefererMiddleware.policy` is
    deprecated. Pass a :class:`~scrapy.http.Response` instead.

    The parameter has also been renamed to ``response`` to reflect this change.
    The old parameter name (``resp_or_url``) is deprecated.

New features
~~~~~~~~~~~~

-   Added a new setting, :setting:`REFERER_POLICIES`, to allow customizing
    supported referrer policies.

Bug fixes
~~~~~~~~~

-   Made additional redirect scenarios convert to ``GET`` in line with the
    `standard <https://fetch.spec.whatwg.org/#http-redirect-fetch>`__:

    -   Only ``POST`` 302 redirects are converted into ``GET`` requests; other
        methods are preserved.

    -   ``HEAD`` 303 redirects are not converted into ``GET`` requests.

    -   ``GET`` 303 redirects do not have their body or standard ``Content-*``
        headers removed.

-   Redirects where the original request body is dropped now also have their
    ``Content-Encoding``, ``Content-Language`` and ``Content-Location`` headers
    removed, in addition to the ``Content-Type`` and ``Content-Length`` headers
    that were already being removed.

-   Redirects now preserve the source URL fragment if the redirect URL does not
    include one. This is useful when using browser-based download handlers,
    such as `scrapy-playwright`_ or `scrapy-zyte-api`_, while letting Scrapy
    handle redirects.

    .. _scrapy-playwright: https://github.com/scrapy-plugins/scrapy-playwright
    .. _scrapy-zyte-api: https://scrapy-zyte-api.readthedocs.io/en/latest/

-   The ``Referer`` header is now removed on redirect if
    :class:`~scrapy.spidermiddlewares.referer.RefererMiddleware` is disabled.

-   The handling of the ``Referer`` header on redirects now takes into account
    the ``Referer-Policy`` header of the response that triggers the redirect.

.. _release-2.14.1:

Scrapy 2.14.1 (2026-01-12)
--------------------------

Deprecations
~~~~~~~~~~~~

-   ``scrapy.utils.defer.maybeDeferred_coro()`` is deprecated. (:issue:`7212`)

Bug fixes
~~~~~~~~~

-   Fixed custom stats collectors that require a ``spider`` argument in their
    ``open_spider()`` and ``close_spider()`` methods not receiving the
    argument when called by the engine.

    Note, however, that the ``spider`` argument is now deprecated and will stop
    being passed in a future version of Scrapy.

    (:issue:`7213`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Replaced deprecated ``codecov/test-results-action@v1`` GitHub Action with
    ``codecov/codecov-action@v5``.
    (:issue:`7180`, :issue:`7215`)

.. _release-2.14.0:

Scrapy 2.14.0 (2026-01-05)
--------------------------

Highlights:

-   More coroutine-based replacements for Deferred-based APIs

-   The default priority queue is now ``DownloaderAwarePriorityQueue``

-   Dropped support for Python 3.9 and PyPy 3.10

-   Improved and documented the API for custom download handlers

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   Dropped support for Python 3.9.
    (:issue:`7121`)

-   Dropped support for PyPy 3.10.
    (:issue:`7050`)

-   Increased the minimum versions of the following dependencies:

    - lxml_: 4.6.0 → 4.6.4

    - Pillow_ (optional dependency): 8.0.0 → 8.3.2

    - botocore_ (optional dependency): 1.4.87 → 1.13.45

-   Restored support for ``brotlicffi`` dropped in Scrapy 2.13.4. Its minimum
    supported version is now ``1.2.0.0``.
    (:issue:`7160`)

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   If you set the :setting:`TWISTED_REACTOR` setting to a :ref:`non-asyncio
    value <disable-asyncio>` at the :ref:`spider level <spider-settings>`, you
    may now need to set the :setting:`FORCE_CRAWLER_PROCESS` setting to
    ``True`` when running Scrapy via :ref:`its command-line tool
    <topics-commands-crawlerprocess>` to avoid a reactor mismatch exception.
    (:issue:`6845`)

-   The ``log_count/*`` stats no longer count some of the early messages that
    they counted before. While the earliest log messages, emitted before the
    counter is initialized, were never counted, the counter initialization now
    happens later than in previous Scrapy versions. You may need to adjust
    expected values if you retrieve and compare values of these stats in your
    code.
    (:issue:`7046`)

-   The classes listed below are now :term:`abstract base classes <abstract
    base class>`. They cannot be instantiated directly and their subclasses
    need to override the abstract methods listed below to be able to be
    instantiated. If you previously instantiated these classes directly, you
    will now need to subclass them and provide trivial (e.g. empty)
    implementations for the abstract methods.

    - :class:`scrapy.commands.ScrapyCommand`

        - :meth:`~scrapy.commands.ScrapyCommand.run`

        - :meth:`~scrapy.commands.ScrapyCommand.short_desc`

    - :class:`scrapy.exporters.BaseItemExporter`

        - :meth:`~scrapy.exporters.BaseItemExporter.export_item`

    - :class:`scrapy.extensions.feedexport.BlockingFeedStorage`

        - :meth:`~scrapy.extensions.feedexport.BlockingFeedStorage._store_in_thread`

    - :class:`scrapy.middleware.MiddlewareManager`

        - :meth:`~scrapy.middleware.MiddlewareManager._get_mwlist_from_settings`

    - :class:`scrapy.spidermiddlewares.referer.ReferrerPolicy`

        - :meth:`~scrapy.spidermiddlewares.referer.ReferrerPolicy.referrer`

    (:issue:`6930`)

-   Scrapy no longer passes a ``spider`` argument to any methods of the
    :setting:`stats collector <STATS_CLASS>`. It wasn't passed in many of the
    calls even in older Scrapy versions, so we don't expect existing custom
    stats collector implementations to require a ``spider`` argument. If your
    implementation needs a :class:`~scrapy.Spider` instance, you can get it
    from the :class:`~scrapy.crawler.Crawler` instance passed to the
    constructor.
    (:issue:`7011`)

-   :class:`scrapy.middleware.MiddlewareManager` no longer includes code for
    handling ``open_spider()`` and ``close_spider()`` component methods. As
    this code was only used for pipelines it was moved into
    :class:`scrapy.pipelines.ItemPipelineManager`. This change should only
    affect custom subclasses of :class:`~scrapy.middleware.MiddlewareManager`.
    The following code was moved:

    - ``scrapy.middleware.MiddlewareManager.open_spider()``

    - ``scrapy.middleware.MiddlewareManager.close_spider()``

    - Code in ``scrapy.middleware.MiddlewareManager._add_middleware()`` that
      processes ``open_spider()`` and ``close_spider()`` component methods.

    (:issue:`7006`)

-   :meth:`scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware.process_request`
    now returns a coroutine, previously it returned a
    :class:`~twisted.internet.defer.Deferred` object or ``None``. The
    ``robot_parser()`` method was also changed to return a coroutine. This
    change only impacts code that subclasses
    :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` or
    calls its methods directly.
    (:issue:`6802`)

-   The built-in :ref:`download handlers <download-handlers-ref>` have been
    refactored, changing the signatures of their methods. This change should
    only affect user code that subclasses any of these handlers or calls their
    methods directly.
    (:issue:`6778`, :issue:`7164`)

-   :meth:`scrapy.pipelines.media.MediaPipeline.process_item` now returns a
    coroutine, previously it returned a
    :class:`~twisted.internet.defer.Deferred` object. This
    change only impacts code that calls this method directly.
    (:issue:`7177`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   The ``from_settings()`` method of the following components, deprecated in
    Scrapy 2.12.0, is removed. You should use ``from_crawler()`` instead.

    - :class:`scrapy.dupefilters.RFPDupeFilter`
    - :class:`scrapy.mail.MailSender`
    - :class:`scrapy.middleware.MiddlewareManager`
    - :class:`scrapy.core.downloader.contextfactory.ScrapyClientContextFactory`
    - :class:`scrapy.pipelines.files.FilesPipeline`
    - :class:`scrapy.pipelines.images.ImagesPipeline`

    (:issue:`7126`)

-   Scrapy no longer calls ``from_settings()`` methods of 3rd-party
    :ref:`components <topics-components>`, deprecated in Scrapy 2.12.0. You
    should define a ``from_crawler()`` method instead.
    (:issue:`7126`)

-   The initialization flow of :class:`scrapy.pipelines.media.MediaPipeline`
    and its subclasses was simplified, it now mandates ``from_crawler()``
    methods and ``crawler`` arguments of ``__init__()`` methods. Not using
    these was deprecated in Scrapy 2.12.0.
    (:issue:`7126`)

-   The ``REQUEST_FINGERPRINTER_IMPLEMENTATION`` setting, deprecated in Scrapy
    2.12.0, is removed.
    (:issue:`7126`)

-   The ``scrapy.utils.misc.create_instance()`` function, deprecated in Scrapy
    2.12.0, is removed. Use :func:`scrapy.utils.misc.build_from_crawler`
    instead.
    (:issue:`7126`)

-   The ``scrapy.core.downloader.Downloader._get_slot_key()`` function,
    deprecated in Scrapy 2.12.0, is removed. Use
    :meth:`scrapy.core.downloader.Downloader.get_slot_key` instead.
    (:issue:`7126`)

-   The ``scrapy.twisted_version`` attribute, deprecated in Scrapy 2.12.0, is
    removed. You should instead use the :attr:`twisted.version` attribute
    directly.
    (:issue:`7126`)

-   The following utility functions, deprecated in Scrapy 2.12.0, are removed:

    - ``scrapy.utils.defer.process_chain_both()``
    - ``scrapy.utils.python.equal_attributes()``
    - ``scrapy.utils.python.flatten()``
    - ``scrapy.utils.python.iflatten()``
    - ``scrapy.utils.request.request_authenticate()``
    - ``scrapy.utils.test.assert_samelines()``

    (:issue:`7126`)

-   ``scrapy.utils.serialize.ScrapyJSONDecoder``, deprecated in Scrapy 2.12.0,
    is removed.
    (:issue:`7126`)

-   The ``scrapy.extensions.feedexport.build_storage()`` function, deprecated
    in Scrapy 2.12.0, is removed, you can instead call the builder callable
    directly.
    (:issue:`7126`)

-   ``scrapy.spidermiddlewares.offsite.OffsiteMiddleware``, deprecated in
    Scrapy 2.11.2, is removed.
    :class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` should be
    used instead.
    (:issue:`6926`)

Deprecations
~~~~~~~~~~~~

-   The following methods that return a
    :class:`~twisted.internet.defer.Deferred` are deprecated in favor of their
    coroutine-based replacements:

    - :class:`scrapy.core.downloader.handlers.DownloadHandlers`

        - ``download_request()`` (use
          :meth:`~scrapy.core.downloader.handlers.DownloadHandlers.download_request_async`)

    - :class:`scrapy.core.downloader.middleware.DownloaderMiddlewareManager`

        - ``download()`` (use
          :meth:`~scrapy.core.downloader.middleware.DownloaderMiddlewareManager.download_async`)

    - :class:`scrapy.core.engine.ExecutionEngine`

        - ``start()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.start_async`)

        - ``stop()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.stop_async`)

        - ``close()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.close_async`)

        - ``open_spider()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.open_spider_async`)

        - ``close_spider()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.close_spider_async`)

        - ``download()`` (use
          :meth:`~scrapy.core.engine.ExecutionEngine.download_async`)

    - :class:`scrapy.core.scraper.Scraper`

        - ``open_spider()`` (use
          :meth:`~scrapy.core.scraper.Scraper.open_spider_async`)

        - ``call_spider()`` (use
          :meth:`~scrapy.core.scraper.Scraper.call_spider_async`)

        - ``close_spider()`` (use
          :meth:`~scrapy.core.scraper.Scraper.close_spider_async`)

        - ``handle_spider_output()`` (use
          :meth:`~scrapy.core.scraper.Scraper.handle_spider_output_async`)

        - ``start_itemproc()`` (use
          :meth:`~scrapy.core.scraper.Scraper.start_itemproc_async`)

    - :class:`scrapy.core.spidermw.SpiderMiddlewareManager`

        - ``scrape_response()`` (use
          :meth:`~scrapy.core.spidermw.SpiderMiddlewareManager.scrape_response_async`)

    - :class:`scrapy.crawler.Crawler`

        - ``stop()`` (use :meth:`~scrapy.crawler.Crawler.stop_async`)

    - :class:`scrapy.pipelines.ItemPipelineManager`

        - ``process_item()`` (use
          :meth:`~scrapy.pipelines.ItemPipelineManager.process_item_async`)

        - ``open_spider()`` (use
          :meth:`~scrapy.pipelines.ItemPipelineManager.open_spider_async`)

        - ``close_spider()`` (use
          :meth:`~scrapy.pipelines.ItemPipelineManager.close_spider_async`)

    - :class:`scrapy.signalmanager.SignalManager`

        - ``send_catch_log_deferred()`` (use
          :meth:`~scrapy.signalmanager.SignalManager.send_catch_log_async`)

    - ``scrapy.utils.signal.send_catch_log_deferred()`` (use
      :func:`scrapy.utils.signal.send_catch_log_async`)

    (:issue:`6791`, :issue:`6842`, :issue:`6979`, :issue:`6997`, :issue:`6999`,
    :issue:`7005`, :issue:`7043`, :issue:`7069`, :issue:`7161`, :issue:`7164`)

-   The following spider attributes are deprecated in favor of settings:

    - ``download_maxsize`` (use :setting:`DOWNLOAD_MAXSIZE`)

    - ``download_timeout`` (use :setting:`DOWNLOAD_TIMEOUT`)

    - ``download_warnsize`` (use :setting:`DOWNLOAD_WARNSIZE`)

    - ``max_concurrent_requests`` (use :setting:`CONCURRENT_REQUESTS`)

    - ``user_agent`` (use :setting:`USER_AGENT`)

    (:issue:`6988`, :issue:`6994`, :issue:`7038`, :issue:`7039`, :issue:`7117`,
    :issue:`7176`)

-   Returning a :class:`~twisted.internet.defer.Deferred` from the following
    user-defined functions is deprecated in favor of defining them as coroutine
    functions:

    - spider callbacks and errbacks (which was never officially supported and
      may work incorrectly)

    - the ``process_request()``, ``process_response()`` and
      ``process_exception()`` methods of custom downloader middlewares

    - the ``process_item()``, ``open_spider()`` and ``close_spider()`` methods
      of custom pipelines

    - signal handlers

    - the ``download_request()`` and ``close()`` methods of custom download
      handlers

    (:issue:`6718`, :issue:`6778`, :issue:`7069`, :issue:`7147`, :issue:`7148`,
    :issue:`7149`, :issue:`7150`, :issue:`7151`, :issue:`7161`, :issue:`7164`,
    :issue:`7179`)

-   Passing a ``spider`` argument to the following methods is deprecated:

    - :meth:`scrapy.core.spidermw.SpiderMiddlewareManager.process_start`

    - :meth:`scrapy.core.downloader.Downloader.fetch`

    - :meth:`scrapy.core.downloader.Downloader._get_slot`

    - :meth:`scrapy.core.downloader.handlers.DownloadHandlers.download_request`

    - all public methods of :class:`scrapy.statscollectors.StatsCollector`

    - :meth:`scrapy.spidermiddlewares.base.BaseSpiderMiddleware.process_spider_output`

    - :meth:`scrapy.spidermiddlewares.base.BaseSpiderMiddleware.process_spider_output_async`

    - all ``process_*()`` methods of built-in downloader middlewares

    - all ``process_*()`` methods of built-in spider middlewares

    - :meth:`scrapy.pipelines.media.MediaPipeline.open_spider`

    - :meth:`scrapy.pipelines.media.MediaPipeline.process_item`

    (:issue:`6750`, :issue:`6927`, :issue:`6984`, :issue:`7006`, :issue:`7011`,
    :issue:`7033`, :issue:`7037`, :issue:`7045`, :issue:`7178`)

-   Instantiating subclasses of :class:`scrapy.middleware.MiddlewareManager`
    without a :class:`~scrapy.crawler.Crawler` instance is deprecated.
    (:issue:`6984`)

-   For the following user-defined functions and methods requiring a ``spider``
    argument is deprecated, if you need a :class:`~scrapy.Spider` instance
    inside them you should get it from the :class:`~scrapy.crawler.Crawler`
    instance (you may need to refactor your code to save that instance in e.g.
    the ``from_crawler()`` method):

    - the ``process_request()``, ``process_response()`` and
      ``process_exception()`` methods of custom downloader middlewares

    - the ``process_spider_input()``, ``process_spider_output()``,
      ``process_spider_output_async()`` and ``process_spider_exception()``
      methods of custom spider middlewares

    - the ``process_item()`` method of custom pipelines

    - the ``fetch()`` method of a custom :setting:`DOWNLOADER`

    (:issue:`6927`, :issue:`6984`, :issue:`7006`, :issue:`7037`)

-   The following things in custom download handlers are deprecated:

    - not having a ``lazy`` attribute (you should define it as ``True`` if you
      want to keep the current behavior)

    - returning a :class:`~twisted.internet.defer.Deferred` from the
      ``download_request()`` method (you should refactor it to return a
      coroutine; you also need to remove the ``spider`` argument when doing
      this)

    - not having a ``close()`` method, having a synchronous one or one that
      returns a :class:`~twisted.internet.defer.Deferred` (you should refactor
      it to return a coroutine or add an empty one if you don't have it)

    (:issue:`6778`, :issue:`7164`)

-   Custom implementations of :setting:`ITEM_PROCESSOR` should now define
    ``process_item_async()``, ``open_spider_async()`` and
    ``close_spider_async()`` methods instead of, or in addition to,
    ``process_item()``, ``open_spider()`` and ``close_spider()``.
    (:issue:`7005`, :issue:`7043`)

-   The ``CONCURRENT_REQUESTS_PER_IP`` setting is deprecated, use
    :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` instead.
    (:issue:`6917`, :issue:`6921`)

-   The ``scrapy.core.downloader.handlers.http`` module is deprecated. You
    should import
    :class:`scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler`
    directly instead of importing the
    ``scrapy.core.downloader.handlers.http.HTTPDownloadHandler`` alias.
    (:issue:`7079`)

-   The ``scrapy.utils.decorators.defers()`` decorator is deprecated, you can
    use :func:`twisted.internet.defer.maybeDeferred` directly or reimplement
    this decorator in your code.
    (:issue:`7164`)

-   ``scrapy.spiders.CrawlSpider._parse_response()`` is deprecated, use
    :meth:`scrapy.spiders.CrawlSpider.parse_with_rules` instead.
    (:issue:`4463`, :issue:`6804`)

-   The functions that add a delay to a Deferred are deprecated, their
    underlying Twisted functions can be used instead, either directly if a
    delay isn't needed, or with some explicit way to add a delay if it's
    needed:

    - ``scrapy.utils.defer.mustbe_deferred()`` (you can use
      :func:`twisted.internet.defer.maybeDeferred`)

    - ``scrapy.utils.defer.defer_succeed()`` (you can use
      :func:`twisted.internet.defer.succeed`)

    - ``scrapy.utils.defer.defer_fail()`` (you can use
      :func:`twisted.internet.defer.fail`)

    - ``scrapy.utils.defer.defer_result()`` (you can use
      :func:`twisted.internet.defer.succeed` and
      :func:`twisted.internet.defer.fail`)

    (:issue:`6937`)

New features
~~~~~~~~~~~~

-   Added :class:`scrapy.crawler.AsyncCrawlerProcess` and
    :class:`scrapy.crawler.AsyncCrawlerRunner` as counterparts to
    :class:`~scrapy.crawler.CrawlerProcess` and
    :class:`~scrapy.crawler.CrawlerRunner` that offer coroutine-based APIs.
    (:issue:`6789`, :issue:`6790`, :issue:`6796`, :issue:`6817`, :issue:`6845`,
    :issue:`7034`)

-   Added coroutine counterparts to some of the Deferred-based APIs:

    - :class:`scrapy.core.downloader.handlers.DownloadHandlers`

        - :meth:`~scrapy.core.downloader.handlers.DownloadHandlers.download_request_async`
          (to ``download_request()``)

    - :class:`scrapy.core.downloader.middleware.DownloaderMiddlewareManager`

        - :meth:`~scrapy.core.downloader.middleware.DownloaderMiddlewareManager.download_async`
          (to ``download()``)

    - :class:`scrapy.core.engine.ExecutionEngine`

        - :meth:`~scrapy.core.engine.ExecutionEngine.start_async` (to
          ``start()``)

        - :meth:`~scrapy.core.engine.ExecutionEngine.stop_async` (to
          ``stop()``)

        - :meth:`~scrapy.core.engine.ExecutionEngine.close_async` (to
          ``close()``)

        - :meth:`~scrapy.core.engine.ExecutionEngine.open_spider_async` (to
          ``open_spider()``)

        - :meth:`~scrapy.core.engine.ExecutionEngine.close_spider_async` (to
          ``close_spider()``)

        - :meth:`~scrapy.core.engine.ExecutionEngine.download_async` (to
          ``download()``)

    - :class:`scrapy.core.scraper.Scraper`

        - :meth:`~scrapy.core.scraper.Scraper.open_spider_async` (to
          ``open_spider()``)

        - :meth:`~scrapy.core.scraper.Scraper.close_spider_async` (to
          ``close_spider()``)

        - :meth:`~scrapy.core.scraper.Scraper.start_itemproc_async` (to
          ``start_itemproc()``)

    - :class:`scrapy.crawler.Crawler`

        - :meth:`~scrapy.crawler.Crawler.crawl_async` (to ``crawl()``)

        - :meth:`~scrapy.crawler.Crawler.stop_async` (to ``stop()``)

    - :class:`scrapy.pipelines.ItemPipelineManager`

        - :meth:`~scrapy.pipelines.ItemPipelineManager.process_item_async` (to
          ``process_item()``)

        - :meth:`~scrapy.pipelines.ItemPipelineManager.open_spider_async` (to
          ``open_spider()``)

        - :meth:`~scrapy.pipelines.ItemPipelineManager.close_spider_async` (to
          ``close_spider()``)

    - :class:`scrapy.signalmanager.SignalManager`

        - :meth:`~scrapy.signalmanager.SignalManager.send_catch_log_async` (to
          ``send_catch_log_deferred()``)

    (:issue:`6781`, :issue:`6791`, :issue:`6792`, :issue:`6795`, :issue:`6801`,
    :issue:`6817`, :issue:`6842`, :issue:`6997`, :issue:`7005`, :issue:`7043`,
    :issue:`7069`,:issue:`7164`, :issue:`7202`)

-   The default value of the :setting:`SCHEDULER_PRIORITY_QUEUE` setting is now
    ``'scrapy.pqueues.DownloaderAwarePriorityQueue'``.
    (:issue:`6924`, :issue:`6940`)

-   Added :class:`scrapy.extensions.logcount.LogCount`, an enabled-by-default
    extension that is responsible for the ``log_count/*`` stats. Previously,
    this code was in :class:`scrapy.crawler.Crawler` and couldn't be disabled.
    (:issue:`7046`)

-   Added :meth:`scrapy.spiders.CrawlSpider.parse_with_rules` as a public
    replacement for ``_parse_response()``.
    (:issue:`4463`, :issue:`6804`)

-   Added :func:`scrapy.utils.asyncio.is_asyncio_available` as an alternative
    to :func:`scrapy.utils.reactor.is_asyncio_reactor_installed` with a
    future-proof name and semantics.
    (:issue:`6827`)

-   The API for :ref:`download handlers <topics-download-handlers>`, previously
    undocumented, has been modernized and documented. An optional base class,
    :class:`scrapy.core.downloader.handlers.base.BaseDownloadHandler`, has been
    added to simplify writing custom download handlers that conform to the
    current API.
    (:issue:`4944`, :issue:`6778`, :issue:`7164`)

-   Added :func:`scrapy.utils.defer.ensure_awaitable`, which can be helpful to
    call user-defined functions that can return coroutines, Deferreds or
    values directly.
    (:issue:`7005`)

-   The ``requests.seen`` file, written by
    :class:`~scrapy.dupefilters.RFPDupeFilter` when :ref:`job persistence
    <topics-jobs>` is enabled, now uses line buffering to reduce data loss in
    spider crashes.
    (:issue:`6019`, :issue:`7094`)

-   Images downloaded by :class:`~scrapy.pipelines.images.ImagesPipeline` are
    now automatically transposed based on EXIF data.
    (:issue:`6525`, :issue:`6975`)

Improvements
~~~~~~~~~~~~

-   Refactored internal functions to use coroutines instead of Deferreds.
    (:issue:`6795`, :issue:`6852`, :issue:`6855`, :issue:`6858`, :issue:`7159`)

-   Commands that don't need a :class:`~scrapy.crawler.CrawlerProcess` instance
    no longer create it.
    (:issue:`6824`)

-   Improved :command:`shell` help formatting when using IPython 9+.
    (:issue:`6915`, :issue:`6980`)

Bug fixes
~~~~~~~~~

-   Setting :setting:`FILES_STORE` or :setting:`IMAGES_STORE` to ``None`` now
    correctly disables the respective pipeline.
    (:issue:`6964`, :issue:`6969`)

-   :class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` now
    uses the URL set in the ``<base>`` tag as the base URL when redirecting to
    a relative URL.
    (:issue:`7042`, :issue:`7047`)

-   Passing ``None`` as a value of the :reqmeta:`download_slot` request meta
    key is now handled in the same way as not setting this meta key at all.
    (:issue:`7172`)

-   Fixed parsing of the first line of ``robots.txt`` files that have a BOM.
    (:issue:`6195`, :issue:`7095`)

Documentation
~~~~~~~~~~~~~

-   Added :ref:`documentation <topics-download-handlers>` about download
    handlers, their API and built-in handlers.
    (:issue:`4944`, :issue:`7164`)

-   Added a section about the `scrapy-spider-metadata`_ library to the
    :ref:`spider argument docs <spiderargs-scrapy-spider-metadata>`.
    (:issue:`6676`, :issue:`6957`, :issue:`7116`)

    .. _scrapy-spider-metadata: https://scrapy-spider-metadata.readthedocs.io/en/latest/

-   Improved :ref:`the docs <coroutine-deferred-apis>` about coroutine-based
    and Deferred-based APIs.
    (:issue:`6800`, :issue:`7146`)

-   Other documentation improvements and fixes.
    (:issue:`7058`, :issue:`7076`, :issue:`7109`, :issue:`7195`, :issue:`7198`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Switched from ``twisted.trial`` to ``pytest-twisted`` and replaced
    remaining ``unittest`` and ``twisted.trial`` features with ``pytest`` ones.
    (:issue:`6658`, :issue:`6873`, :issue:`6884`, :issue:`6938`)

-   Enabled fancy ``pytest`` asserts.
    (:issue:`6888`)

-   Added `Sphinx Lint`_ to the ``pre-commit`` configuration.
    (:issue:`6920`)

    .. _Sphinx Lint: https://github.com/sphinx-contrib/sphinx-lint

-   CI and test improvements and fixes.
    (:issue:`6649`,
    :issue:`6769`,
    :issue:`6821`,
    :issue:`6835`,
    :issue:`6836`,
    :issue:`6846`,
    :issue:`6883`,
    :issue:`6885`,
    :issue:`6889`,
    :issue:`6905`,
    :issue:`6928`,
    :issue:`6933`,
    :issue:`6941`,
    :issue:`6942`,
    :issue:`6945`,
    :issue:`6947`,
    :issue:`6960`,
    :issue:`6968`,
    :issue:`6972`,
    :issue:`6974`,
    :issue:`6996`,
    :issue:`7003`,
    :issue:`7012`,
    :issue:`7013`,
    :issue:`7050`,
    :issue:`7059`,
    :issue:`7070`,
    :issue:`7073`,
    :issue:`7118`,
    :issue:`7127`,
    :issue:`7141`,
    :issue:`7143`,
    :issue:`7145`,
    :issue:`7173`)

-   Code cleanups.
    (:issue:`6803`,
    :issue:`6838`,
    :issue:`6849`,
    :issue:`6875`,
    :issue:`6876`,
    :issue:`6892`,
    :issue:`6930`,
    :issue:`6949`,
    :issue:`6970`,
    :issue:`6977`,
    :issue:`6986`,
    :issue:`7008`,
    :issue:`7177`)

.. _release-2.13.4:

Scrapy 2.13.4 (2025-11-17)
--------------------------

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Improved protection against decompression bombs in
    :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`
    for responses compressed using the ``br`` and ``deflate`` methods: if a
    single compressed chunk would be larger than the response size limit (see
    :setting:`DOWNLOAD_MAXSIZE`) when decompressed, decompression is no longer
    carried out. This is especially important for the ``br`` (Brotli) method
    that can provide a very high compression ratio. Please, see the
    `CVE-2025-6176`_ and `GHSA-2qfp-q593-8484`_ security advisories for more
    information.
    (:issue:`7134`)

    .. _CVE-2025-6176: https://nvd.nist.gov/vuln/detail/CVE-2025-6176
    .. _GHSA-2qfp-q593-8484: https://github.com/advisories/GHSA-2qfp-q593-8484

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   The minimum supported version of the optional ``brotli`` package is now
    ``1.2.0``.
    (:issue:`7134`)

-   The ``brotlicffi`` and ``brotlipy`` packages can no longer be used to
    decompress Brotli-compressed responses. Please install the ``brotli``
    package instead.
    (:issue:`7134`)

Other changes
~~~~~~~~~~~~~

-   Restricted the maximum supported Twisted version to ``25.5.0``, as Scrapy
    currently uses some private APIs changed in later Twisted versions.
    (:issue:`7142`)

-   Stopped setting the ``COVERAGE_CORE`` environment variable in tests, it
    didn't have an effect but caused the ``coverage`` module to produce a
    warning or an error.
    (:issue:`7137`)

-   Removed the documentation build dependency on the deprecated
    ``sphinx-hoverxref`` module.
    (:issue:`6786`, :issue:`6922`)

.. _release-2.13.3:

Scrapy 2.13.3 (2025-07-02)
--------------------------

-   Changed the values for :setting:`DOWNLOAD_DELAY` (from ``0`` to ``1``) and
    :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` (from ``8`` to ``1``) in the
    default project template.
    (:issue:`6597`, :issue:`6918`, :issue:`6923`)

-   Improved :class:`scrapy.core.engine.ExecutionEngine` logic related to
    initialization and exception handling, fixing several cases where the
    spider would crash, hang or log an unhandled exception.
    (:issue:`6783`, :issue:`6784`, :issue:`6900`, :issue:`6908`, :issue:`6910`,
    :issue:`6911`)

-   Fixed a Windows issue with :ref:`feed exports <topics-feed-exports>` using
    :class:`scrapy.extensions.feedexport.FileFeedStorage` that caused the file
    to be created on the wrong drive.
    (:issue:`6894`, :issue:`6897`)

-   Allowed running tests with Twisted 25.5.0+ again. Pytest 8.4.1+ is now
    required for running tests in non-pinned envs as support for the new
    Twisted version was added in that version.
    (:issue:`6893`)

-   Fixed running tests with lxml 6.0.0+.
    (:issue:`6919`)

-   Added a deprecation notice for
    ``scrapy.spidermiddlewares.offsite.OffsiteMiddleware`` to :ref:`the Scrapy
    2.11.2 release notes <release-2.11.2>`.
    (:issue:`6926`)

-   Updated :ref:`contribution docs <topics-contributing>` to refer to ruff_
    instead of black_.
    (:issue:`6903`)

-   Added ``.venv/`` and ``.vscode/`` to ``.gitignore``.
    (:issue:`6901`, :issue:`6907`)

.. _release-2.13.2:

Scrapy 2.13.2 (2025-06-09)
--------------------------

-   Fixed a bug introduced in Scrapy 2.13.0 that caused results of request
    errbacks to be ignored when the errback was called because of a downloader
    error.
    (:issue:`6861`, :issue:`6863`)

-   Added a note about the behavior change of
    :func:`scrapy.utils.reactor.is_asyncio_reactor_installed` to its docs and
    to the "Backward-incompatible changes" section of :ref:`the Scrapy 2.13.0
    release notes <release-2.13.0>`.
    (:issue:`6866`)

-   Improved the message in the exception raised by
    :func:`scrapy.utils.test.get_reactor_settings` when there is no reactor
    installed.
    (:issue:`6866`)

-   Updated the :class:`scrapy.crawler.CrawlerRunner` examples in
    :ref:`topics-practices` to install the reactor explicitly, to fix
    reactor-related errors with Scrapy 2.13.0 and later.
    (:issue:`6865`)

-   Fixed ``scrapy fetch`` not working with scrapy-poet_.
    (:issue:`6872`)

-   Fixed an exception produced by :class:`scrapy.core.engine.ExecutionEngine`
    when it's closed before being fully initialized.
    (:issue:`6857`, :issue:`6867`)

-   Improved the README, updated the Scrapy logo in it.
    (:issue:`6831`, :issue:`6833`, :issue:`6839`)

-   Restricted the Twisted version used in tests to below 25.5.0, as some tests
    fail with 25.5.0.
    (:issue:`6878`, :issue:`6882`)

-   Updated type hints for Twisted 25.5.0 changes.
    (:issue:`6882`)

-   Removed the old artwork.
    (:issue:`6874`)

.. _release-2.13.1:

Scrapy 2.13.1 (2025-05-28)
--------------------------

-   Give callback requests precedence over start requests when priority values
    are the same.

    This makes changes from 2.13.0 to start request handling more intuitive and
    backward compatible. For scenarios where all requests have the same
    priorities, in 2.13.0 all start requests were sent before the first
    callback request. In 2.13.1, same as in 2.12 and lower, start requests are
    only sent when there are not enough pending callback requests to reach
    concurrency limits.

    (:issue:`6828`)

-   Added a deepwiki_ badge to the README. (:issue:`6793`)

    .. _deepwiki: https://deepwiki.com/scrapy/scrapy

-   Fixed a typo in the code example of :ref:`start-requests-lazy`.
    (:issue:`6812`, :issue:`6815`)

-   Fixed a typo in the :ref:`coroutine-support` section of the documentation.
    (:issue:`6822`)

-   Made this page more prominently listed in PyPI project links.
    (:issue:`6826`)

.. _release-2.13.0:

Scrapy 2.13.0 (2025-05-08)
--------------------------

Highlights:

-   The asyncio reactor is now enabled by default

-   Replaced ``start_requests()`` (sync) with :meth:`~scrapy.Spider.start`
    (async) and changed how it is iterated

-   Added the :reqmeta:`allow_offsite` request meta key

-   :ref:`Spider middlewares that don't support asynchronous spider output
    <sync-async-spider-middleware>` are deprecated

-   Added a base class for :ref:`universal spider middlewares
    <universal-spider-middleware>`

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   Dropped support for PyPy 3.9.
    (:issue:`6613`)

-   Added support for PyPy 3.11.
    (:issue:`6697`)

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   The default value of the :setting:`TWISTED_REACTOR` setting was changed
    from ``None`` to
    ``"twisted.internet.asyncioreactor.AsyncioSelectorReactor"``. This value
    was used in newly generated projects since Scrapy 2.7.0 but now existing
    projects that don't explicitly set this setting will also use the asyncio
    reactor. You can :ref:`change this setting in your project
    <disable-asyncio>` to use a different reactor.
    (:issue:`6659`, :issue:`6713`)

-   The iteration of start requests and items no longer stops once there are
    requests in the scheduler, and instead runs continuously until all start
    requests have been scheduled.

    To reproduce the previous behavior, see :ref:`start-requests-lazy`.
    (:issue:`6729`)

-   An unhandled exception from the
    :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.open_spider` method of a
    :ref:`spider middleware <topics-spider-middleware>` no longer stops the
    crawl.
    (:issue:`6729`)

-   In ``scrapy.core.engine.ExecutionEngine``:

    -   The second parameter of ``open_spider()``, ``start_requests``, has been
        removed. The start requests are determined by the ``spider`` parameter
        instead (see :meth:`~scrapy.Spider.start`).

    -   The ``slot`` attribute has been renamed to ``_slot`` and should not be
        used.

    (:issue:`6729`)

-   In ``scrapy.core.engine``, the ``Slot`` class has been renamed to ``_Slot``
    and should not be used.
    (:issue:`6729`)

-   The ``slot`` :ref:`telnet variable <telnet-vars>` has been removed.
    (:issue:`6729`)

-   In ``scrapy.core.spidermw.SpiderMiddlewareManager``,
    ``process_start_requests()`` has been replaced by ``process_start()``.
    (:issue:`6729`)

-   The now-deprecated ``start_requests()`` method, when it returns an iterable
    instead of being defined as a generator, is now executed *after* the
    :ref:`scheduler <topics-scheduler>` instance has been created.
    (:issue:`6729`)

-   When using :setting:`JOBDIR`, :ref:`start requests <start-requests>` are
    now serialized into their own, ``s``-suffixed priority folders. You can set
    :setting:`SCHEDULER_START_DISK_QUEUE` to ``None`` or ``""`` to change that,
    but the side effects may be undesirable. See
    :setting:`SCHEDULER_START_DISK_QUEUE` for details.
    (:issue:`6729`)

-   The URL length limit, set by the :setting:`URLLENGTH_LIMIT` setting, is now
    also enforced for start requests.
    (:issue:`6777`)

-   Calling :func:`scrapy.utils.reactor.is_asyncio_reactor_installed` without
    an installed reactor now raises an exception instead of installing a
    reactor. This shouldn't affect normal Scrapy use cases, but it may affect
    3rd-party test suites that use Scrapy internals such as
    :class:`~scrapy.crawler.Crawler` and don't install a reactor explicitly. If
    you are affected by this change, you most likely need to install the
    reactor before running Scrapy code that expects it to be installed.
    (:issue:`6732`, :issue:`6735`)

-   The ``from_settings()`` method of
    :class:`~scrapy.spidermiddlewares.urllength.UrlLengthMiddleware`,
    deprecated in Scrapy 2.12.0, is removed earlier than the usual deprecation
    period (this was needed because after the introduction of the
    :class:`~scrapy.spidermiddlewares.base.BaseSpiderMiddleware` base class and
    switching built-in spider middlewares to it those middlewares need the
    :class:`~scrapy.crawler.Crawler` instance at run time). Please use
    ``from_crawler()`` instead.
    (:issue:`6693`)

-   ``scrapy.utils.url.escape_ajax()`` is no longer called when a
    :class:`~scrapy.Request` instance is created. It was only useful for
    websites supporting the ``_escaped_fragment_`` feature which most modern
    websites don't support. If you still need this you can modify the URLs
    before passing them to :class:`~scrapy.Request`.
    (:issue:`6523`, :issue:`6651`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   Removed old deprecated name aliases for some signals:

    - ``stats_spider_opened`` (use ``spider_opened`` instead)

    - ``stats_spider_closing`` and ``stats_spider_closed`` (use
      ``spider_closed`` instead)

    - ``item_passed`` (use ``item_scraped`` instead)

    - ``request_received`` (use ``request_scheduled`` instead)

    (:issue:`6654`, :issue:`6655`)

Deprecations
~~~~~~~~~~~~

-   The ``start_requests()`` method of :class:`~scrapy.Spider` is deprecated,
    use :meth:`~scrapy.Spider.start` instead, or both to maintain support for
    lower Scrapy versions.
    (:issue:`456`, :issue:`3477`, :issue:`4467`, :issue:`5627`, :issue:`6729`)

-   The ``process_start_requests()`` method of :ref:`spider middlewares
    <topics-spider-middleware>` is deprecated, use
    :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_start` instead,
    or both to maintain support for lower Scrapy versions.
    (:issue:`456`, :issue:`3477`, :issue:`4467`, :issue:`5627`, :issue:`6729`)

-   The ``__init__`` method of priority queue classes (see
    :setting:`SCHEDULER_PRIORITY_QUEUE`) should now support a keyword-only
    ``start_queue_cls`` parameter.
    (:issue:`6752`)

-   :ref:`Spider middlewares that don't support asynchronous spider output
    <sync-async-spider-middleware>` are deprecated. The async iterable
    downgrading feature, needed for using such middlewares with asynchronous
    callbacks and with other spider middlewares that produce asynchronous
    iterables, is also deprecated. Please update all such middlewares to
    support asynchronous spider output.
    (:issue:`6664`)

-   Functions that were imported from :mod:`w3lib.url` and re-exported in
    :mod:`scrapy.utils.url` are now deprecated, you should import them from
    :mod:`w3lib.url` directly. They are:

    - ``scrapy.utils.url.add_or_replace_parameter()``

    - ``scrapy.utils.url.add_or_replace_parameters()``

    - ``scrapy.utils.url.any_to_uri()``

    - ``scrapy.utils.url.canonicalize_url()``

    - ``scrapy.utils.url.file_uri_to_path()``

    - ``scrapy.utils.url.is_url()``

    - ``scrapy.utils.url.parse_data_uri()``

    - ``scrapy.utils.url.parse_url()``

    - ``scrapy.utils.url.path_to_file_uri()``

    - ``scrapy.utils.url.safe_download_url()``

    - ``scrapy.utils.url.safe_url_string()``

    - ``scrapy.utils.url.url_query_cleaner()``

    - ``scrapy.utils.url.url_query_parameter()``

    (:issue:`4577`, :issue:`6583`, :issue:`6586`)

-   HTTP/1.0 support code is deprecated. It was disabled by default and
    couldn't be used together with HTTP/1.1. If you still need it, you should
    write your own download handler or copy the code from Scrapy. The
    deprecations include:

    - ``scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler``

    - ``scrapy.core.downloader.webclient.ScrapyHTTPClientFactory``

    - ``scrapy.core.downloader.webclient.ScrapyHTTPPageGetter``

    - Overriding
      ``scrapy.core.downloader.contextfactory.ScrapyClientContextFactory.getContext()``

    (:issue:`6634`)

-   The following modules and functions used only in tests are deprecated:

    - the ``scrapy.utils.testproc`` module

    - the ``scrapy.utils.testsite`` module

    - ``scrapy.utils.test.assert_gcs_environ()``

    - ``scrapy.utils.test.get_ftp_content_and_delete()``

    - ``scrapy.utils.test.get_gcs_content_and_delete()``

    - ``scrapy.utils.test.mock_google_cloud_storage()``

    - ``scrapy.utils.test.skip_if_no_boto()``

    If you need to use them in your tests or code, you can copy the code from Scrapy.
    (:issue:`6696`)

-   ``scrapy.utils.test.TestSpider`` is deprecated. If you need an empty spider
    class you can use :class:`scrapy.utils.spider.DefaultSpider` or create your
    own subclass of :class:`scrapy.Spider`.
    (:issue:`6678`)

-   ``scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware`` is
    deprecated. It was disabled by default and isn't useful for most of the
    existing websites.
    (:issue:`6523`, :issue:`6651`, :issue:`6656`)

-   ``scrapy.utils.url.escape_ajax()`` is deprecated.
    (:issue:`6523`, :issue:`6651`)

-   ``scrapy.spiders.init.InitSpider`` is deprecated. If you find it useful,
    you can copy its code from Scrapy.
    (:issue:`6708`, :issue:`6714`)

-   ``scrapy.utils.versions.scrapy_components_versions()`` is deprecated, use
    :func:`scrapy.utils.versions.get_versions` instead.
    (:issue:`6582`)

-   ``BaseDupeFilter.log()`` is deprecated. It does nothing and shouldn't be
    called.
    (:issue:`4151`)

-   Passing the ``spider`` argument to the following methods of
    :class:`~scrapy.core.scraper.Scraper` is deprecated:

    - ``close_spider()``

    - ``enqueue_scrape()``

    - ``handle_spider_error()``

    - ``handle_spider_output()``

    (:issue:`6764`)

New features
~~~~~~~~~~~~

-   You can now yield the start requests and items of a spider from the
    :meth:`~scrapy.Spider.start` spider method and from the
    :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_start` spider
    middleware method, both :term:`asynchronous generators <python:asynchronous
    generator>`.

    This makes it possible to use asynchronous code to generate those start
    requests and items, e.g. reading them from a queue service or database
    using an asynchronous client, without workarounds.
    (:issue:`456`, :issue:`3477`, :issue:`4467`, :issue:`5627`, :issue:`6729`)

-   Start requests are now :ref:`scheduled <topics-scheduler>` as soon as
    possible.

    As a result, their :attr:`~scrapy.Request.priority` is now taken into
    account as soon as :setting:`CONCURRENT_REQUESTS` is reached.
    (:issue:`456`, :issue:`3477`, :issue:`4467`, :issue:`5627`, :issue:`6729`)

-   :class:`Crawler.signals <scrapy.signalmanager.SignalManager>` has a new
    :meth:`~scrapy.signalmanager.SignalManager.wait_for` method.
    (:issue:`6729`)

-   Added a new :signal:`scheduler_empty` signal.
    (:issue:`6729`)

-   Added new settings: :setting:`SCHEDULER_START_DISK_QUEUE` and
    :setting:`SCHEDULER_START_MEMORY_QUEUE`.
    (:issue:`6729`)

-   Added :class:`~scrapy.spidermiddlewares.start.StartSpiderMiddleware`, which
    sets :reqmeta:`is_start_request` to ``True`` on :ref:`start requests
    <start-requests>`.
    (:issue:`6729`)

-   Exposed a new method of :class:`Crawler.engine
    <scrapy.core.engine.ExecutionEngine>`:
    :meth:`~scrapy.core.engine.ExecutionEngine.needs_backout`.
    (:issue:`6729`)

-   Added the :reqmeta:`allow_offsite` request meta key that can be used
    instead of the more general :attr:`~scrapy.Request.dont_filter` request
    attribute to skip processing of the request by
    :class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` (but not
    by other code that checks :attr:`~scrapy.Request.dont_filter`).
    (:issue:`3690`, :issue:`6151`, :issue:`6366`)

-   Added an optional base class for spider middlewares,
    :class:`~scrapy.spidermiddlewares.base.BaseSpiderMiddleware`, which can be
    helpful for writing :ref:`universal spider middlewares
    <universal-spider-middleware>` without boilerplate and code duplication.
    The built-in spider middlewares now inherit from this class.
    (:issue:`6693`, :issue:`6777`)

-   :ref:`Scrapy add-ons <topics-addons>` can now define a class method called
    ``update_pre_crawler_settings()`` to update :ref:`pre-crawler settings
    <pre-crawler-settings>`.
    (:issue:`6544`, :issue:`6568`)

-   Added :ref:`helpers <priority-dict-helpers>` for modifying :ref:`component
    priority dictionary <component-priority-dictionaries>` settings.
    (:issue:`6614`)

-   Responses that use an unknown/unsupported encoding now produce a warning.
    If Scrapy knows that installing an additional package (such as brotli_)
    will allow decoding the response, that will be mentioned in the warning.
    (:issue:`4697`, :issue:`6618`)

-   Added the ``spider_exceptions/count`` stat which tracks the total count of
    exceptions (tracked also by per-type ``spider_exceptions/*`` stats).
    (:issue:`6739`, :issue:`6740`)

-   Added the :setting:`DEFAULT_DROPITEM_LOG_LEVEL` setting and the
    :attr:`scrapy.exceptions.DropItem.log_level` attribute that allow
    customizing the log level of the message that is logged when an item is
    dropped.
    (:issue:`6603`, :issue:`6608`)

-   Added support for the ``-b, --cookie`` curl argument to
    :meth:`scrapy.Request.from_curl`.
    (:issue:`6684`)

-   Added the :setting:`LOG_VERSIONS` setting that allows customizing the
    list of software whose versions are logged when the spider starts.
    (:issue:`6582`)

-   Added the :setting:`WARN_ON_GENERATOR_RETURN_VALUE` setting that allows
    disabling run time analysis of callback code used to warn about incorrect
    ``return`` statements in generator-based callbacks. You may need to disable
    this setting if this analysis breaks on your callback code.
    (:issue:`6731`, :issue:`6738`)

Improvements
~~~~~~~~~~~~

-   Removed or postponed some calls of :func:`itemadapter.is_item` to increase
    performance.
    (:issue:`6719`)

-   Improved the error message when running a ``scrapy`` command that requires
    a project (such as ``scrapy crawl``) outside of a project directory.
    (:issue:`2349`, :issue:`3426`)

-   Added an empty :setting:`ADDONS` setting to the ``settings.py`` template
    for new projects.
    (:issue:`6587`)

Bug fixes
~~~~~~~~~

-   Yielding an item from :meth:`Spider.start <scrapy.Spider.start>` or from
    :meth:`SpiderMiddleware.process_start
    <scrapy.spidermiddlewares.SpiderMiddleware.process_start>` no longer delays
    the next iteration of starting requests and items by up to 5 seconds.
    (:issue:`6729`)

-   Fixed calculation of ``items_per_minute`` and ``responses_per_minute``
    stats.
    (:issue:`6599`)

-   Fixed an error initializing
    :class:`scrapy.extensions.feedexport.GCSFeedStorage`.
    (:issue:`6617`, :issue:`6628`)

-   Fixed an error running ``scrapy bench``.
    (:issue:`6632`, :issue:`6633`)

-   Fixed duplicated log messages about the reactor and the event loop.
    (:issue:`6636`, :issue:`6657`)

-   Fixed resolving type annotations of ``SitemapSpider._parse_sitemap()`` at
    run time, required by tools such as scrapy-poet_.
    (:issue:`6665`, :issue:`6671`)

    .. _scrapy-poet: https://github.com/scrapinghub/scrapy-poet

-   Calling :func:`scrapy.utils.reactor.is_asyncio_reactor_installed` without
    an installed reactor now raises an exception instead of installing a
    reactor.
    (:issue:`6732`, :issue:`6735`)

-   Restored support for the ``x-gzip`` content encoding.
    (:issue:`6618`)

Documentation
~~~~~~~~~~~~~

-   Documented the setting values set in the default project template.
    (:issue:`6762`, :issue:`6775`)

-   Improved the :ref:`docs <sync-async-spider-middleware>` about asynchronous
    iterable support in spider middlewares.
    (:issue:`6688`)

-   Improved the :ref:`docs <coroutine-deferred-apis>` about using
    :class:`~twisted.internet.defer.Deferred`-based APIs in coroutine-based
    code and included a list of such APIs.
    (:issue:`6677`, :issue:`6734`, :issue:`6776`)

-   Improved the :ref:`contribution docs <topics-contributing>`.
    (:issue:`6561`, :issue:`6575`)

-   Removed the ``Splash`` recommendation from the :ref:`headless browser
    <topics-headless-browsing>` suggestion. We no longer recommend using
    ``Splash`` and recommend using other headless browser solutions instead.
    (:issue:`6642`, :issue:`6701`)

-   Added the dark mode to the HTML documentation.
    (:issue:`6653`)

-   Other documentation improvements and fixes.
    (:issue:`4151`,
    :issue:`6526`,
    :issue:`6620`,
    :issue:`6621`,
    :issue:`6622`,
    :issue:`6623`,
    :issue:`6624`,
    :issue:`6721`,
    :issue:`6723`,
    :issue:`6780`)

Packaging
~~~~~~~~~

-   Switched from ``setup.py`` to ``pyproject.toml``.
    (:issue:`6514`, :issue:`6547`)

-   Switched the build backend from setuptools_ to hatchling_.
    (:issue:`6771`)

    .. _hatchling: https://pypi.org/project/hatchling/

Quality assurance
~~~~~~~~~~~~~~~~~

-   Replaced most linters with ruff_.
    (:issue:`6565`,
    :issue:`6576`,
    :issue:`6577`,
    :issue:`6581`,
    :issue:`6584`,
    :issue:`6595`,
    :issue:`6601`,
    :issue:`6631`)

    .. _ruff: https://docs.astral.sh/ruff/

-   Improved accuracy and performance of collecting test coverage.
    (:issue:`6255`, :issue:`6610`)

-   Fixed an error that prevented running tests from directories other than the
    top level source directory.
    (:issue:`6567`)

-   Reduced the amount of ``mockserver`` calls in tests to improve the overall
    test run time.
    (:issue:`6637`, :issue:`6648`)

-   Fixed tests that were running the same test code more than once.
    (:issue:`6646`, :issue:`6647`, :issue:`6650`)

-   Refactored tests to use more ``pytest`` features instead of ``unittest``
    ones where possible.
    (:issue:`6678`,
    :issue:`6680`,
    :issue:`6695`,
    :issue:`6699`,
    :issue:`6700`,
    :issue:`6702`,
    :issue:`6709`,
    :issue:`6710`,
    :issue:`6711`,
    :issue:`6712`,
    :issue:`6725`)

-   Type hints improvements and fixes.
    (:issue:`6578`,
    :issue:`6579`,
    :issue:`6593`,
    :issue:`6605`,
    :issue:`6694`)

-   CI and test improvements and fixes.
    (:issue:`5360`,
    :issue:`6271`,
    :issue:`6547`,
    :issue:`6560`,
    :issue:`6602`,
    :issue:`6607`,
    :issue:`6609`,
    :issue:`6613`,
    :issue:`6619`,
    :issue:`6626`,
    :issue:`6679`,
    :issue:`6703`,
    :issue:`6704`,
    :issue:`6716`,
    :issue:`6720`,
    :issue:`6722`,
    :issue:`6724`,
    :issue:`6741`,
    :issue:`6743`,
    :issue:`6766`,
    :issue:`6770`,
    :issue:`6772`,
    :issue:`6773`)

-   Code cleanups.
    (:issue:`6600`,
    :issue:`6606`,
    :issue:`6635`,
    :issue:`6764`)

.. _release-2.12.0:

Scrapy 2.12.0 (2024-11-18)
--------------------------

Highlights:

-   Dropped support for Python 3.8, added support for Python 3.13

-   ``scrapy.Spider.start_requests()`` can now yield items

-   Added :class:`~scrapy.http.JsonResponse`

-   Added :setting:`CLOSESPIDER_PAGECOUNT_NO_ITEM`

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   Dropped support for Python 3.8.
    (:issue:`6466`, :issue:`6472`)

-   Added support for Python 3.13.
    (:issue:`6166`)

-   Minimum versions increased for these dependencies:

    -   Twisted_: 18.9.0 → 21.7.0

    -   cryptography_: 36.0.0 → 37.0.0

    -   pyOpenSSL_: 21.0.0 → 22.0.0

    -   lxml_: 4.4.1 → 4.6.0

-   Removed ``setuptools`` from the dependency list.
    (:issue:`6487`)

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   User-defined cookies for HTTPS requests will have the ``secure`` flag set
    to ``True`` unless it's set to ``False`` explictly. This is important when
    these cookies are reused in HTTP requests, e.g. after a redirect to an HTTP
    URL.
    (:issue:`6357`)

-   The Reppy-based ``robots.txt`` parser,
    ``scrapy.robotstxt.ReppyRobotParser``, was removed, as it doesn't support
    Python 3.9+.
    (:issue:`5230`, :issue:`6099`, :issue:`6499`)

-   The initialization API of :class:`scrapy.pipelines.media.MediaPipeline` and
    its subclasses was improved and it's possible that some previously working
    usage scenarios will no longer work. It can only affect you if you define
    custom subclasses of ``MediaPipeline`` or create instances of these
    pipelines via ``from_settings()`` or ``__init__()`` calls instead of
    ``from_crawler()`` calls.

    Previously, ``MediaPipeline.from_crawler()`` called the ``from_settings()``
    method if it existed or the ``__init__()`` method otherwise, and then did
    some additional initialization using the ``crawler`` instance. If the
    ``from_settings()`` method existed (like in ``FilesPipeline``) it called
    ``__init__()`` to create the instance. It wasn't possible to override
    ``from_crawler()`` without calling ``MediaPipeline.from_crawler()`` from it
    which, in turn, couldn't be called in some cases (including subclasses of
    ``FilesPipeline``).

    Now, in line with the general usage of ``from_crawler()`` and
    ``from_settings()`` and the deprecation of the latter the recommended
    initialization order is the following one:

    - All ``__init__()`` methods should take a ``crawler`` argument. If they
      also take a ``settings`` argument they should ignore it, using
      ``crawler.settings`` instead. When they call ``__init__()`` of the base
      class they should pass the ``crawler`` argument to it too.
    - A ``from_settings()`` method shouldn't be defined. Class-specific
      initialization code should go into either an overriden ``from_crawler()``
      method or into ``__init__()``.
    - It's now possible to override ``from_crawler()`` and it's not necessary
      to call ``MediaPipeline.from_crawler()`` in it if other recommendations
      were followed.
    - If pipeline instances were created with ``from_settings()`` or
      ``__init__()`` calls (which wasn't supported even before, as it missed
      important initialization code), they should now be created with
      ``from_crawler()`` calls.

    (:issue:`6540`)

-   The ``response_body`` argument of :meth:`ImagesPipeline.convert_image
    <scrapy.pipelines.images.ImagesPipeline.convert_image>` is now
    positional-only, as it was changed from optional to required.
    (:issue:`6500`)

-   The ``convert`` argument of :func:`scrapy.utils.conf.build_component_list`
    is now positional-only, as the preceding argument (``custom``) was removed.
    (:issue:`6500`)

-   The ``overwrite_output`` argument of
    :func:`scrapy.utils.conf.feed_process_params_from_cli` is now
    positional-only, as the preceding argument (``output_format``) was removed.
    (:issue:`6500`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   Removed the ``scrapy.utils.request.request_fingerprint()`` function,
    deprecated in Scrapy 2.7.0.
    (:issue:`6212`, :issue:`6213`)

-   Removed support for value ``"2.6"`` of setting
    ``REQUEST_FINGERPRINTER_IMPLEMENTATION``, deprecated in Scrapy 2.7.0.
    (:issue:`6212`, :issue:`6213`)

-   :class:`~scrapy.dupefilters.RFPDupeFilter` subclasses now require
    supporting the ``fingerprinter`` parameter in their ``__init__`` method,
    introduced in Scrapy 2.7.0.
    (:issue:`6102`, :issue:`6113`)

-   Removed the ``scrapy.downloadermiddlewares.decompression`` module,
    deprecated in Scrapy 2.7.0.
    (:issue:`6100`, :issue:`6113`)

-   Removed the ``scrapy.utils.response.response_httprepr()`` function,
    deprecated in Scrapy 2.6.0.
    (:issue:`6111`, :issue:`6116`)

-   Spiders with spider-level HTTP authentication, i.e. with the ``http_user``
    or ``http_pass`` attributes, must now define ``http_auth_domain`` as well,
    which was introduced in Scrapy 2.5.1.
    (:issue:`6103`, :issue:`6113`)

-   :ref:`Media pipelines <topics-media-pipeline>` methods ``file_path()``,
    ``file_downloaded()``, ``get_images()``, ``image_downloaded()``,
    ``media_downloaded()``, ``media_to_download()``, and ``thumb_path()`` must
    now support an ``item`` parameter, added in Scrapy 2.4.0.
    (:issue:`6107`, :issue:`6113`)

-   The ``__init__()`` and ``from_crawler()`` methods of :ref:`feed storage
    backend classes <topics-feed-storage>` must now support the keyword-only
    ``feed_options`` parameter, introduced in Scrapy 2.4.0.
    (:issue:`6105`, :issue:`6113`)

-   Removed the ``scrapy.loader.common`` and ``scrapy.loader.processors``
    modules, deprecated in Scrapy 2.3.0.
    (:issue:`6106`, :issue:`6113`)

-   Removed the ``scrapy.utils.misc.extract_regex()`` function, deprecated in
    Scrapy 2.3.0.
    (:issue:`6106`, :issue:`6113`)

-   Removed the ``scrapy.http.JSONRequest`` class, replaced with
    ``JsonRequest`` in Scrapy 1.8.0.
    (:issue:`6110`, :issue:`6113`)

-   ``scrapy.utils.log.logformatter_adapter`` no longer supports missing
    ``args``, ``level``, or ``msg`` parameters, and no longer supports a
    ``format`` parameter, all scenarios that were deprecated in Scrapy 1.0.0.
    (:issue:`6109`, :issue:`6116`)

-   A custom class assigned to the :setting:`SPIDER_LOADER_CLASS` setting that
    does not implement the :class:`~scrapy.interfaces.ISpiderLoader` interface
    will now raise a :exc:`zope.interface.verify.DoesNotImplement` exception at
    run time. Non-compliant classes have been triggering a deprecation warning
    since Scrapy 1.0.0.
    (:issue:`6101`, :issue:`6113`)

-   Removed the ``--output-format``/``-t`` command line option, deprecated in
    Scrapy 2.1.0. ``-O <URI>:<FORMAT>`` should be used instead.
    (:issue:`6500`)

-   Running :meth:`~scrapy.crawler.Crawler.crawl` more than once on the same
    :class:`~scrapy.crawler.Crawler` instance, deprecated in Scrapy 2.11.0, now
    raises an exception.
    (:issue:`6500`)

-   Subclassing
    :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`
    without support for the ``crawler`` argument in ``__init__()`` and without
    a custom ``from_crawler()`` method, deprecated in Scrapy 2.5.0, is no
    longer allowed.
    (:issue:`6500`)

-   Removed the ``EXCEPTIONS_TO_RETRY`` attribute of
    :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware`, deprecated in
    Scrapy 2.10.0.
    (:issue:`6500`)

-   Removed support for :ref:`S3 feed exports <topics-feed-storage-s3>` without
    the boto3_ package installed, deprecated in Scrapy 2.10.0.
    (:issue:`6500`)

-   Removed the ``scrapy.extensions.feedexport._FeedSlot`` class, deprecated in
    Scrapy 2.10.0.
    (:issue:`6500`)

-   Removed the ``scrapy.pipelines.images.NoimagesDrop`` exception, deprecated
    in Scrapy 2.8.0.
    (:issue:`6500`)

-   The ``response_body`` argument of :meth:`ImagesPipeline.convert_image
    <scrapy.pipelines.images.ImagesPipeline.convert_image>` is now required,
    not passing it was deprecated in Scrapy 2.8.0.
    (:issue:`6500`)

-   Removed the ``custom`` argument of
    :func:`scrapy.utils.conf.build_component_list`, deprecated in Scrapy
    2.10.0.
    (:issue:`6500`)

-   Removed the ``scrapy.utils.reactor.get_asyncio_event_loop_policy()``
    function, deprecated in Scrapy 2.9.0. Use :func:`asyncio.get_event_loop`
    and related standard library functions instead.
    (:issue:`6500`)

Deprecations
~~~~~~~~~~~~

-   The ``from_settings()`` methods of the :ref:`Scrapy components
    <topics-components>` that have them are now deprecated. ``from_crawler()``
    should now be used instead. Affected components:

    - :class:`scrapy.dupefilters.RFPDupeFilter`
    - :class:`scrapy.mail.MailSender`
    - :class:`scrapy.middleware.MiddlewareManager`
    - :class:`scrapy.core.downloader.contextfactory.ScrapyClientContextFactory`
    - :class:`scrapy.pipelines.files.FilesPipeline`
    - :class:`scrapy.pipelines.images.ImagesPipeline`
    - :class:`scrapy.spidermiddlewares.urllength.UrlLengthMiddleware`

    (:issue:`6540`)

-   It's now deprecated to have a ``from_settings()`` method but no
    ``from_crawler()`` method in 3rd-party :ref:`Scrapy components
    <topics-components>`. You can define a simple ``from_crawler()`` method
    that calls ``cls.from_settings(crawler.settings)`` to fix this if you don't
    want to refactor the code. Note that if you have a ``from_crawler()``
    method Scrapy will not call the ``from_settings()`` method so the latter
    can be removed.
    (:issue:`6540`)

-   The initialization API of :class:`scrapy.pipelines.media.MediaPipeline` and
    its subclasses was improved and some old usage scenarios are now deprecated
    (see also the "Backward-incompatible changes" section). Specifically:

    - It's deprecated to define an ``__init__()`` method that doesn't take a
      ``crawler`` argument.
    - It's deprecated to call an ``__init__()`` method without passing a
      ``crawler`` argument. If it's passed, it's also deprecated to pass a
      ``settings`` argument, which will be ignored anyway.
    - Calling ``from_settings()`` is deprecated, use ``from_crawler()``
      instead.
    - Overriding ``from_settings()`` is deprecated, override ``from_crawler()``
      instead.

    (:issue:`6540`)

-   The ``REQUEST_FINGERPRINTER_IMPLEMENTATION`` setting is now deprecated.
    (:issue:`6212`, :issue:`6213`)

-   The ``scrapy.utils.misc.create_instance()`` function is now deprecated, use
    :func:`scrapy.utils.misc.build_from_crawler` instead.
    (:issue:`5523`, :issue:`5884`, :issue:`6162`, :issue:`6169`, :issue:`6540`)

-   ``scrapy.core.downloader.Downloader._get_slot_key()`` is deprecated, use
    :meth:`scrapy.core.downloader.Downloader.get_slot_key` instead.
    (:issue:`6340`, :issue:`6352`)

-   ``scrapy.utils.defer.process_chain_both()`` is now deprecated.
    (:issue:`6397`)

-   ``scrapy.twisted_version`` is now deprecated, you should instead use
    :attr:`twisted.version` directly (but note that it's an
    ``incremental.Version`` object, not a tuple).
    (:issue:`6509`, :issue:`6512`)

-   ``scrapy.utils.python.flatten()`` and ``scrapy.utils.python.iflatten()``
    are now deprecated.
    (:issue:`6517`, :issue:`6519`)

-   ``scrapy.utils.python.equal_attributes()`` is now deprecated.
    (:issue:`6517`, :issue:`6519`)

-   ``scrapy.utils.request.request_authenticate()`` is now deprecated, you
    should instead just set the ``Authorization`` header directly.
    (:issue:`6517`, :issue:`6519`)

-   ``scrapy.utils.serialize.ScrapyJSONDecoder`` is now deprecated, it didn't
    contain any code since Scrapy 1.0.0.
    (:issue:`6517`, :issue:`6519`)

-   ``scrapy.utils.test.assert_samelines()`` is now deprecated.
    (:issue:`6517`, :issue:`6519`)

-   ``scrapy.extensions.feedexport.build_storage()`` is now deprecated. You can
    instead call the builder callable directly.
    (:issue:`6540`)

New features
~~~~~~~~~~~~

-   ``scrapy.Spider.start_requests()`` can now yield items.
    (:issue:`5289`, :issue:`6417`)

    .. note:: Some spider middlewares may need to be updated for Scrapy 2.12
        support before you can use them in combination with the ability to
        yield items from ``start_requests()``.

-   Added a new :class:`~scrapy.http.Response` subclass,
    :class:`~scrapy.http.JsonResponse`, for responses with a `JSON MIME type
    <https://mimesniff.spec.whatwg.org/#json-mime-type>`_.
    (:issue:`6069`, :issue:`6171`, :issue:`6174`)

-   The :class:`~scrapy.extensions.logstats.LogStats` extension now adds
    ``items_per_minute`` and ``responses_per_minute`` to the :ref:`stats
    <topics-stats>` when the spider closes.
    (:issue:`4110`, :issue:`4111`)

-   Added :setting:`CLOSESPIDER_PAGECOUNT_NO_ITEM` which allows closing the
    spider if no items were scraped in a set amount of time.
    (:issue:`6434`)

-   User-defined cookies can now include the ``secure`` field.
    (:issue:`6357`)

-   Added component getters to :class:`~scrapy.crawler.Crawler`:
    :meth:`~scrapy.crawler.Crawler.get_addon`,
    :meth:`~scrapy.crawler.Crawler.get_downloader_middleware`,
    :meth:`~scrapy.crawler.Crawler.get_extension`,
    :meth:`~scrapy.crawler.Crawler.get_item_pipeline`,
    :meth:`~scrapy.crawler.Crawler.get_spider_middleware`.
    (:issue:`6181`)

-   Slot delay updates by the :ref:`AutoThrottle extension
    <topics-autothrottle>` based on response latencies can now be disabled for
    specific requests via the :reqmeta:`autothrottle_dont_adjust_delay` meta
    key.
    (:issue:`6246`, :issue:`6527`)

-   If :setting:`SPIDER_LOADER_WARN_ONLY` is set to ``True``,
    :class:`~scrapy.spiderloader.SpiderLoader` does not raise
    :exc:`SyntaxError` but emits a warning instead.
    (:issue:`6483`, :issue:`6484`)

-   Added support for multiple-compressed responses (ones with several
    encodings in the ``Content-Encoding`` header).
    (:issue:`5143`, :issue:`5964`, :issue:`6063`)

-   Added support for multiple standard values in :setting:`REFERRER_POLICY`.
    (:issue:`6381`)

-   Added support for brotlicffi_ (previously named brotlipy_). brotli_ is
    still recommended but only brotlicffi_ works on PyPy.
    (:issue:`6263`, :issue:`6269`)

    .. _brotlicffi: https://github.com/python-hyper/brotlicffi

-   Added :class:`~scrapy.contracts.default.MetadataContract` that sets the
    request meta.
    (:issue:`6468`, :issue:`6469`)

Improvements
~~~~~~~~~~~~

-   Extended the list of file extensions that
    :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    ignores by default.
    (:issue:`6074`, :issue:`6125`)

-   :func:`scrapy.utils.httpobj.urlparse_cached` is now used in more places
    instead of :func:`urllib.parse.urlparse`.
    (:issue:`6228`, :issue:`6229`)

Bug fixes
~~~~~~~~~

-   :class:`~scrapy.pipelines.media.MediaPipeline` is now an abstract class and
    its methods that were expected to be overridden in subclasses are now
    abstract methods.
    (:issue:`6365`, :issue:`6368`)

-   Fixed handling of invalid ``@``-prefixed lines in contract extraction.
    (:issue:`6383`, :issue:`6388`)

-   Importing ``scrapy.extensions.telnet`` no longer installs the default
    reactor.
    (:issue:`6432`)

-   Reduced log verbosity for dropped requests that was increased in 2.11.2.
    (:issue:`6433`, :issue:`6475`)

Documentation
~~~~~~~~~~~~~

-   Added ``SECURITY.md`` that documents the security policy.
    (:issue:`5364`, :issue:`6051`)

-   Example code for :ref:`running Scrapy from a script <run-from-script>` no
    longer imports ``twisted.internet.reactor`` at the top level, which caused
    problems with non-default reactors when this code was used unmodified.
    (:issue:`6361`, :issue:`6374`)

-   Documented the :class:`~scrapy.extensions.spiderstate.SpiderState`
    extension.
    (:issue:`6278`, :issue:`6522`)

-   Other documentation improvements and fixes.
    (:issue:`5920`,
    :issue:`6094`,
    :issue:`6177`,
    :issue:`6200`,
    :issue:`6207`,
    :issue:`6216`,
    :issue:`6223`,
    :issue:`6317`,
    :issue:`6328`,
    :issue:`6389`,
    :issue:`6394`,
    :issue:`6402`,
    :issue:`6411`,
    :issue:`6427`,
    :issue:`6429`,
    :issue:`6440`,
    :issue:`6448`,
    :issue:`6449`,
    :issue:`6462`,
    :issue:`6497`,
    :issue:`6506`,
    :issue:`6507`,
    :issue:`6524`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added ``py.typed``, in line with `PEP 561
    <https://peps.python.org/pep-0561/>`_.
    (:issue:`6058`, :issue:`6059`)

-   Fully covered the code with type hints (except for the most complicated
    parts, mostly related to ``twisted.web.http`` and other Twisted parts
    without type hints).
    (:issue:`5989`,
    :issue:`6097`,
    :issue:`6127`,
    :issue:`6129`,
    :issue:`6130`,
    :issue:`6133`,
    :issue:`6143`,
    :issue:`6191`,
    :issue:`6268`,
    :issue:`6274`,
    :issue:`6275`,
    :issue:`6276`,
    :issue:`6279`,
    :issue:`6325`,
    :issue:`6326`,
    :issue:`6333`,
    :issue:`6335`,
    :issue:`6336`,
    :issue:`6337`,
    :issue:`6341`,
    :issue:`6353`,
    :issue:`6356`,
    :issue:`6370`,
    :issue:`6371`,
    :issue:`6384`,
    :issue:`6385`,
    :issue:`6387`,
    :issue:`6391`,
    :issue:`6395`,
    :issue:`6414`,
    :issue:`6422`,
    :issue:`6460`,
    :issue:`6466`,
    :issue:`6472`,
    :issue:`6494`,
    :issue:`6498`,
    :issue:`6516`)

-   Improved Bandit_ checks.
    (:issue:`6260`, :issue:`6264`, :issue:`6265`)

-   Added pyupgrade_ to the ``pre-commit`` configuration.
    (:issue:`6392`)

    .. _pyupgrade: https://github.com/asottile/pyupgrade

-   Added ``flake8-bugbear``, ``flake8-comprehensions``, ``flake8-debugger``,
    ``flake8-docstrings``, ``flake8-string-format`` and
    ``flake8-type-checking`` to the ``pre-commit`` configuration.
    (:issue:`6406`, :issue:`6413`)

-   CI and test improvements and fixes.
    (:issue:`5285`,
    :issue:`5454`,
    :issue:`5997`,
    :issue:`6078`,
    :issue:`6084`,
    :issue:`6087`,
    :issue:`6132`,
    :issue:`6153`,
    :issue:`6154`,
    :issue:`6201`,
    :issue:`6231`,
    :issue:`6232`,
    :issue:`6235`,
    :issue:`6236`,
    :issue:`6242`,
    :issue:`6245`,
    :issue:`6253`,
    :issue:`6258`,
    :issue:`6259`,
    :issue:`6270`,
    :issue:`6272`,
    :issue:`6286`,
    :issue:`6290`,
    :issue:`6296`
    :issue:`6367`,
    :issue:`6372`,
    :issue:`6403`,
    :issue:`6416`,
    :issue:`6435`,
    :issue:`6489`,
    :issue:`6501`,
    :issue:`6504`,
    :issue:`6511`,
    :issue:`6543`,
    :issue:`6545`)

-   Code cleanups.
    (:issue:`6196`,
    :issue:`6197`,
    :issue:`6198`,
    :issue:`6199`,
    :issue:`6254`,
    :issue:`6257`,
    :issue:`6285`,
    :issue:`6305`,
    :issue:`6343`,
    :issue:`6349`,
    :issue:`6386`,
    :issue:`6415`,
    :issue:`6463`,
    :issue:`6470`,
    :issue:`6499`,
    :issue:`6505`,
    :issue:`6510`,
    :issue:`6531`,
    :issue:`6542`)

Other
~~~~~

-   Issue tracker improvements. (:issue:`6066`)

.. _release-2.11.2:

Scrapy 2.11.2 (2024-05-14)
--------------------------

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Redirects to non-HTTP protocols are no longer followed. Please, see the
    `23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`)

    .. _23j4-mw76-5v7h security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h

-   The ``Authorization`` header is now dropped on redirects to a different
    scheme (``http://`` or ``https://``) or port, even if the domain is the
    same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more
    information.

    .. _4qqq-9vqf-3h3f security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f

-   When using system proxy settings that are different for ``http://`` and
    ``https://``, redirects to a different URL scheme will now also trigger the
    corresponding change in proxy settings for the redirected request. Please,
    see the `jm3v-qxmh-hxwv security advisory`_ for more information.
    (:issue:`767`)

    .. _jm3v-qxmh-hxwv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv

-   :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now
    enforced for all requests, and not only requests from spider callbacks.
    (:issue:`1042`, :issue:`2241`, :issue:`6358`)

-   :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML
    entities. (:issue:`6265`)

-   defusedxml_ is now used to make
    :class:`scrapy.http.request.rpc.XmlRpcRequest` more secure.
    (:issue:`6250`, :issue:`6251`)

    .. _defusedxml: https://github.com/tiran/defusedxml

Deprecations
~~~~~~~~~~~~

-   ``scrapy.spidermiddlewares.offsite.OffsiteMiddleware`` (a spider
    middleware) is now deprecated and not enabled by default. The new
    downloader middleware with the same functionality,
    :class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, is enabled
    instead.
    (:issue:`2241`, :issue:`6358`)

Bug fixes
~~~~~~~~~

-   Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in
    favor of brotli_. (:issue:`6261`)

    .. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli
        instead if you can.

-   Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by default. This
    prevents
    :class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from
    following redirects that would not be followed by web browsers with
    JavaScript enabled. (:issue:`6342`, :issue:`6347`)

-   During :ref:`feed export <topics-feed-exports>`, do not close the
    underlying file from :ref:`built-in post-processing plugins
    <builtin-plugins>`.
    (:issue:`5932`, :issue:`6178`, :issue:`6239`)

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    now properly applies the ``unique`` and ``canonicalize`` parameters.
    (:issue:`3273`, :issue:`6221`)

-   Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty
    string. (:issue:`6121`, :issue:`6124`)

-   Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra
    information. (:issue:`6323`, :issue:`6324`)

-   ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing
    the UTF-8-compatible (e.g. ASCII) parts of the document.
    (:issue:`6292`, :issue:`6298`)

-   :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an
    exception if ``default`` is ``None``.
    (:issue:`6308`, :issue:`6310`)

-   :class:`~scrapy.Selector` now uses
    :func:`scrapy.utils.response.get_base_url` to determine the base URL of a
    given :class:`~scrapy.http.Response`. (:issue:`6265`)

-   The :meth:`media_to_download` method of :ref:`media pipelines
    <topics-media-pipeline>` now logs exceptions before stripping them.
    (:issue:`5067`, :issue:`5068`)

-   When passing a callback to the :command:`parse` command, build the callback
    callable with the right signature.
    (:issue:`6182`)

Documentation
~~~~~~~~~~~~~

-   Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`.
    (:issue:`6203`, :issue:`6208`)

-   Document that :attr:`scrapy.Selector.type` can be ``"json"``.
    (:issue:`6328`, :issue:`6334`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Make builds reproducible. (:issue:`5019`, :issue:`6322`)

-   Packaging and test fixes.
    (:issue:`6286`, :issue:`6290`, :issue:`6312`, :issue:`6316`, :issue:`6344`)

.. _release-2.11.1:

Scrapy 2.11.1 (2024-02-14)
--------------------------

Highlights:

-   Security bug fixes.

-   Support for Twisted >= 23.8.0.

-   Documentation improvements.

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Addressed `ReDoS vulnerabilities`_:

    -   ``scrapy.utils.iterators.xmliter`` is now deprecated in favor of
        :func:`~scrapy.utils.iterators.xmliter_lxml`, which
        :class:`~scrapy.spiders.XMLFeedSpider` now uses.

        To minimize the impact of this change on existing code,
        :func:`~scrapy.utils.iterators.xmliter_lxml` now supports indicating
        the node namespace with a prefix in the node name, and big files with
        highly nested trees when using libxml2 2.7+.

    -   Fixed regular expressions in the implementation of the
        :func:`~scrapy.utils.response.open_in_browser` function.

    Please, see the `cc65-xxvf-f7r9 security advisory`_ for more information.

    .. _ReDoS vulnerabilities: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
    .. _cc65-xxvf-f7r9 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cc65-xxvf-f7r9

-   :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
    to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
    advisory`_ for more information.

    .. _7j7m-v7m3-jqm7 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7

-   Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, the
    deprecated ``scrapy.downloadermiddlewares.decompression`` module has been
    removed.

-   The ``Authorization`` header is now dropped on redirects to a different
    domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
    information.

    .. _cw9j-q3vf-hrrv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cw9j-q3vf-hrrv

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   The Twisted dependency is no longer restricted to < 23.8.0. (:issue:`6024`,
    :issue:`6064`, :issue:`6142`)

Bug fixes
~~~~~~~~~

-   The OS signal handling code was refactored to no longer use private Twisted
    functions. (:issue:`6024`, :issue:`6064`, :issue:`6112`)

Documentation
~~~~~~~~~~~~~

-   Improved documentation for :class:`~scrapy.crawler.Crawler` initialization
    changes made in the 2.11.0 release. (:issue:`6057`, :issue:`6147`)

-   Extended documentation for :attr:`.Request.meta`.
    (:issue:`5565`)

-   Fixed the :reqmeta:`dont_merge_cookies` documentation. (:issue:`5936`,
    :issue:`6077`)

-   Added a link to Zyte's export guides to the :ref:`feed exports
    <topics-feed-exports>` documentation. (:issue:`6183`)

-   Added a missing note about backward-incompatible changes in
    :class:`~scrapy.exporters.PythonItemExporter` to the 2.11.0 release notes.
    (:issue:`6060`, :issue:`6081`)

-   Added a missing note about removing the deprecated
    ``scrapy.utils.boto.is_botocore()`` function to the 2.8.0 release notes.
    (:issue:`6056`, :issue:`6061`)

-   Other documentation improvements. (:issue:`6128`, :issue:`6144`,
    :issue:`6163`, :issue:`6190`, :issue:`6192`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added Python 3.12 to the CI configuration, re-enabled tests that were
    disabled when the pre-release support was added. (:issue:`5985`,
    :issue:`6083`, :issue:`6098`)

-   Fixed a test issue on PyPy 7.3.14. (:issue:`6204`, :issue:`6205`)

.. _release-2.11.0:

Scrapy 2.11.0 (2023-09-18)
--------------------------

Highlights:

-   Spiders can now modify :ref:`settings <topics-settings>` in their
    :meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider
    arguments <spiderargs>`.

-   Periodic logging of stats.

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   Most of the initialization of :class:`scrapy.crawler.Crawler` instances is
    now done in :meth:`~scrapy.crawler.Crawler.crawl`, so the state of
    instances before that method is called is now different compared to older
    Scrapy versions. We do not recommend using the
    :class:`~scrapy.crawler.Crawler` instances before
    :meth:`~scrapy.crawler.Crawler.crawl` is called. (:issue:`6038`)

-   :meth:`scrapy.Spider.from_crawler` is now called before the initialization
    of various components previously initialized in
    :meth:`scrapy.crawler.Crawler.__init__` and before the settings are
    finalized and frozen. This change was needed to allow changing the settings
    in :meth:`scrapy.Spider.from_crawler`. If you want to access the final
    setting values and the initialized :class:`~scrapy.crawler.Crawler`
    attributes in the spider code as early as possible you can do this in
    ``scrapy.Spider.start_requests()`` or in a handler of the
    :signal:`engine_started` signal. (:issue:`6038`)

-   The :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method now
    requires the response to be in a valid JSON encoding (UTF-8, UTF-16, or
    UTF-32). If you need to deal with JSON documents in an invalid encoding,
    use ``json.loads(response.text)`` instead. (:issue:`6016`)

-   :class:`~scrapy.exporters.PythonItemExporter` used the binary output by
    default but it no longer does. (:issue:`6006`, :issue:`6007`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   Removed the binary export mode of
    :class:`~scrapy.exporters.PythonItemExporter`, deprecated in Scrapy 1.1.0.
    (:issue:`6006`, :issue:`6007`)

    .. note:: If you are using this Scrapy version on Scrapy Cloud with a stack
              that includes an older Scrapy version and get a "TypeError:
              Unexpected options: binary" error, you may need to add
              ``scrapinghub-entrypoint-scrapy >= 0.14.1`` to your project
              requirements or switch to a stack that includes Scrapy 2.11.

-   Removed the ``CrawlerRunner.spiders`` attribute, deprecated in Scrapy
    1.0.0, use :attr:`CrawlerRunner.spider_loader
    <scrapy.crawler.CrawlerRunner.spider_loader>` instead. (:issue:`6010`)

-   The :func:`scrapy.utils.response.response_httprepr` function, deprecated in
    Scrapy 2.6.0, has now been removed. (:issue:`6111`)

Deprecations
~~~~~~~~~~~~

-   Running :meth:`~scrapy.crawler.Crawler.crawl` more than once on the same
    :class:`scrapy.crawler.Crawler` instance is now deprecated. (:issue:`1587`,
    :issue:`6040`)

New features
~~~~~~~~~~~~

-   Spiders can now modify settings in their
    :meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider
    arguments <spiderargs>`. (:issue:`1305`, :issue:`1580`, :issue:`2392`,
    :issue:`3663`, :issue:`6038`)

-   Added the :class:`~scrapy.extensions.periodic_log.PeriodicLog` extension
    which can be enabled to log stats and/or their differences periodically.
    (:issue:`5926`)

-   Optimized the memory usage in :meth:`TextResponse.json
    <scrapy.http.TextResponse.json>` by removing unnecessary body decoding.
    (:issue:`5968`, :issue:`6016`)

-   Links to ``.webp`` files are now ignored by :ref:`link extractors
    <topics-link-extractors>`. (:issue:`6021`)

Bug fixes
~~~~~~~~~

-   Fixed logging enabled add-ons. (:issue:`6036`)

-   Fixed :class:`~scrapy.mail.MailSender` producing invalid message bodies
    when the ``charset`` argument is passed to
    :meth:`~scrapy.mail.MailSender.send`. (:issue:`5096`, :issue:`5118`)

-   Fixed an exception when accessing ``self.EXCEPTIONS_TO_RETRY`` from a
    subclass of :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware`.
    (:issue:`6049`, :issue:`6050`)

-   :meth:`scrapy.settings.BaseSettings.getdictorlist`, used to parse
    :setting:`FEED_EXPORT_FIELDS`, now handles tuple values. (:issue:`6011`,
    :issue:`6013`)

-   Calls to ``datetime.utcnow()``, no longer recommended to be used, have been
    replaced with calls to ``datetime.now()`` with a timezone. (:issue:`6014`)

Documentation
~~~~~~~~~~~~~

-   Updated a deprecated function call in a pipeline example. (:issue:`6008`,
    :issue:`6009`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Extended typing hints. (:issue:`6003`, :issue:`6005`, :issue:`6031`,
    :issue:`6034`)

-   Pinned brotli_ to 1.0.9 for the PyPy tests as 1.1.0 breaks them.
    (:issue:`6044`, :issue:`6045`)

-   Other CI and pre-commit improvements. (:issue:`6002`, :issue:`6013`,
    :issue:`6046`)

.. _release-2.10.1:

Scrapy 2.10.1 (2023-08-30)
--------------------------

Marked ``Twisted >= 23.8.0`` as unsupported. (:issue:`6024`, :issue:`6026`)

.. _release-2.10.0:

Scrapy 2.10.0 (2023-08-04)
--------------------------

Highlights:

-   Added Python 3.12 support, dropped Python 3.7 support.

-   The new add-ons framework simplifies configuring 3rd-party components that
    support it.

-   Exceptions to retry can now be configured.

-   Many fixes and improvements for feed exports.

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   Dropped support for Python 3.7. (:issue:`5953`)

-   Added support for the upcoming Python 3.12. (:issue:`5984`)

-   Minimum versions increased for these dependencies:

    -   lxml_: 4.3.0 → 4.4.1

    -   cryptography_: 3.4.6 → 36.0.0

-   ``pkg_resources`` is no longer used. (:issue:`5956`, :issue:`5958`)

-   boto3_ is now recommended instead of botocore_ for exporting to S3.
    (:issue:`5833`).

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   The value of the :setting:`FEED_STORE_EMPTY` setting is now ``True``
    instead of ``False``. In earlier Scrapy versions empty files were created
    even when this setting was ``False`` (which was a bug that is now fixed),
    so the new default should keep the old behavior. (:issue:`872`,
    :issue:`5847`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   When a function is assigned to the :setting:`FEED_URI_PARAMS` setting,
    returning ``None`` or modifying the ``params`` input parameter, deprecated
    in Scrapy 2.6, is no longer supported. (:issue:`5994`, :issue:`5996`)

-   The ``scrapy.utils.reqser`` module, deprecated in Scrapy 2.6, is removed.
    (:issue:`5994`, :issue:`5996`)

-   The ``scrapy.squeues`` classes ``PickleFifoDiskQueueNonRequest``,
    ``PickleLifoDiskQueueNonRequest``, ``MarshalFifoDiskQueueNonRequest``,
    and ``MarshalLifoDiskQueueNonRequest``, deprecated in
    Scrapy 2.6, are removed. (:issue:`5994`, :issue:`5996`)

-   The property ``open_spiders`` and the methods ``has_capacity`` and
    ``schedule`` of :class:`scrapy.core.engine.ExecutionEngine`,
    deprecated in Scrapy 2.6, are removed. (:issue:`5994`, :issue:`5998`)

-   Passing a ``spider`` argument to the
    :meth:`~scrapy.core.engine.ExecutionEngine.spider_is_idle`,
    :meth:`~scrapy.core.engine.ExecutionEngine.crawl` and
    :meth:`~scrapy.core.engine.ExecutionEngine.download` methods of
    :class:`scrapy.core.engine.ExecutionEngine`, deprecated in Scrapy 2.6, is
    no longer supported. (:issue:`5994`, :issue:`5998`)

Deprecations
~~~~~~~~~~~~

-   :class:`scrapy.utils.datatypes.CaselessDict` is deprecated, use
    :class:`scrapy.utils.datatypes.CaseInsensitiveDict` instead.
    (:issue:`5146`)

-   Passing the ``custom`` argument to
    :func:`scrapy.utils.conf.build_component_list` is deprecated, it was used
    in the past to merge ``FOO`` and ``FOO_BASE`` setting values but now Scrapy
    uses :func:`scrapy.settings.BaseSettings.getwithbase` to do the same.
    Code that uses this argument and cannot be switched to ``getwithbase()``
    can be switched to merging the values explicitly. (:issue:`5726`,
    :issue:`5923`)

New features
~~~~~~~~~~~~

-   Added support for :ref:`Scrapy add-ons <topics-addons>`. (:issue:`5950`)

-   Added the :setting:`RETRY_EXCEPTIONS` setting that configures which
    exceptions will be retried by
    :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware`.
    (:issue:`2701`, :issue:`5929`)

-   Added the possiiblity to close the spider if no items were produced in the
    specified time, configured by :setting:`CLOSESPIDER_TIMEOUT_NO_ITEM`.
    (:issue:`5979`)

-   Added support for the :setting:`AWS_REGION_NAME` setting to feed exports.
    (:issue:`5980`)

-   Added support for using :class:`pathlib.Path` objects that refer to
    absolute Windows paths in the :setting:`FEEDS` setting. (:issue:`5939`)

Bug fixes
~~~~~~~~~

-   Fixed creating empty feeds even with ``FEED_STORE_EMPTY=False``.
    (:issue:`872`, :issue:`5847`)

-   Fixed using absolute Windows paths when specifying output files.
    (:issue:`5969`, :issue:`5971`)

-   Fixed problems with uploading large files to S3 by switching to multipart
    uploads (requires boto3_). (:issue:`960`, :issue:`5735`, :issue:`5833`)

-   Fixed the JSON exporter writing extra commas when some exceptions occur.
    (:issue:`3090`, :issue:`5952`)

-   Fixed the "read of closed file" error in the CSV exporter. (:issue:`5043`,
    :issue:`5705`)

-   Fixed an error when a component added by the class object throws
    :exc:`~scrapy.exceptions.NotConfigured` with a message. (:issue:`5950`,
    :issue:`5992`)

-   Added the missing :meth:`scrapy.settings.BaseSettings.pop` method.
    (:issue:`5959`, :issue:`5960`, :issue:`5963`)

-   Added :class:`~scrapy.utils.datatypes.CaseInsensitiveDict` as a replacement
    for :class:`~scrapy.utils.datatypes.CaselessDict` that fixes some API
    inconsistencies. (:issue:`5146`)

Documentation
~~~~~~~~~~~~~

-   Documented :meth:`scrapy.Spider.update_settings`. (:issue:`5745`,
    :issue:`5846`)

-   Documented possible problems with early Twisted reactor installation and
    their solutions. (:issue:`5981`, :issue:`6000`)

-   Added examples of making additional requests in callbacks. (:issue:`5927`)

-   Improved the feed export docs. (:issue:`5579`, :issue:`5931`)

-   Clarified the docs about request objects on redirection. (:issue:`5707`,
    :issue:`5937`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added support for running tests against the installed Scrapy version.
    (:issue:`4914`, :issue:`5949`)

-   Extended typing hints. (:issue:`5925`, :issue:`5977`)

-   Fixed the ``test_utils_asyncio.AsyncioTest.test_set_asyncio_event_loop``
    test. (:issue:`5951`)

-   Fixed the ``test_feedexport.BatchDeliveriesTest.test_batch_path_differ``
    test on Windows. (:issue:`5847`)

-   Enabled CI runs for Python 3.11 on Windows. (:issue:`5999`)

-   Simplified skipping tests that depend on ``uvloop``. (:issue:`5984`)

-   Fixed the ``extra-deps-pinned`` tox env. (:issue:`5948`)

-   Implemented cleanups. (:issue:`5965`, :issue:`5986`)

.. _release-2.9.0:

Scrapy 2.9.0 (2023-05-08)
-------------------------

Highlights:

-   Per-domain download settings.
-   Compatibility with new cryptography_ and new parsel_.
-   JMESPath selectors from the new parsel_.
-   Bug fixes.

Deprecations
~~~~~~~~~~~~

-   :class:`scrapy.extensions.feedexport._FeedSlot` is renamed to
    :class:`scrapy.extensions.feedexport.FeedSlot` and the old name is
    deprecated. (:issue:`5876`)

New features
~~~~~~~~~~~~

-   Settings corresponding to :setting:`DOWNLOAD_DELAY`,
    :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
    :setting:`RANDOMIZE_DOWNLOAD_DELAY` can now be set on a per-domain basis
    via the new :setting:`DOWNLOAD_SLOTS` setting. (:issue:`5328`)

-   Added :meth:`.TextResponse.jmespath`, a shortcut for JMESPath selectors
    available since parsel_ 1.8.1. (:issue:`5894`, :issue:`5915`)

-   Added :signal:`feed_slot_closed` and :signal:`feed_exporter_closed`
    signals. (:issue:`5876`)

-   Added :func:`scrapy.utils.request.request_to_curl`, a function to produce a
    curl command from a :class:`~scrapy.Request` object. (:issue:`5892`)

-   Values of :setting:`FILES_STORE` and :setting:`IMAGES_STORE` can now be
    :class:`pathlib.Path` instances. (:issue:`5801`)

Bug fixes
~~~~~~~~~

-   Fixed a warning with Parsel 1.8.1+. (:issue:`5903`, :issue:`5918`)

-   Fixed an error when using feed postprocessing with S3 storage.
    (:issue:`5500`, :issue:`5581`)

-   Added the missing :meth:`scrapy.settings.BaseSettings.setdefault` method.
    (:issue:`5811`, :issue:`5821`)

-   Fixed an error when using cryptography_ 40.0.0+ and
    :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` is enabled.
    (:issue:`5857`, :issue:`5858`)

-   The checksums returned by :class:`~scrapy.pipelines.files.FilesPipeline`
    for files on Google Cloud Storage are no longer Base64-encoded.
    (:issue:`5874`, :issue:`5891`)

-   :func:`scrapy.utils.request.request_from_curl` now supports $-prefixed
    string values for the curl ``--data-raw`` argument, which are produced by
    browsers for data that includes certain symbols. (:issue:`5899`,
    :issue:`5901`)

-   The :command:`parse` command now also works with async generator callbacks.
    (:issue:`5819`, :issue:`5824`)

-   The :command:`genspider` command now properly works with HTTPS URLs.
    (:issue:`3553`, :issue:`5808`)

-   Improved handling of asyncio loops. (:issue:`5831`, :issue:`5832`)

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    now skips certain malformed URLs instead of raising an exception.
    (:issue:`5881`)

-   :func:`scrapy.utils.python.get_func_args` now supports more types of
    callables. (:issue:`5872`, :issue:`5885`)

-   Fixed an error when processing non-UTF8 values of ``Content-Type`` headers.
    (:issue:`5914`, :issue:`5917`)

-   Fixed an error breaking user handling of send failures in
    :meth:`scrapy.mail.MailSender.send`. (:issue:`1611`, :issue:`5880`)

Documentation
~~~~~~~~~~~~~

-   Expanded contributing docs. (:issue:`5109`, :issue:`5851`)

-   Added blacken-docs_ to pre-commit and reformatted the docs with it.
    (:issue:`5813`, :issue:`5816`)

-   Fixed a JS issue. (:issue:`5875`, :issue:`5877`)

-   Fixed ``make htmlview``. (:issue:`5878`, :issue:`5879`)

-   Fixed typos and other small errors. (:issue:`5827`, :issue:`5839`,
    :issue:`5883`, :issue:`5890`, :issue:`5895`, :issue:`5904`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Extended typing hints. (:issue:`5805`, :issue:`5889`, :issue:`5896`)

-   Tests for most of the examples in the docs are now run as a part of CI,
    found problems were fixed. (:issue:`5816`, :issue:`5826`, :issue:`5919`)

-   Removed usage of deprecated Python classes. (:issue:`5849`)

-   Silenced ``include-ignored`` warnings from coverage. (:issue:`5820`)

-   Fixed a random failure of the ``test_feedexport.test_batch_path_differ``
    test. (:issue:`5855`, :issue:`5898`)

-   Updated docstrings to match output produced by parsel_ 1.8.1 so that they
    don't cause test failures. (:issue:`5902`, :issue:`5919`)

-   Other CI and pre-commit improvements. (:issue:`5802`, :issue:`5823`,
    :issue:`5908`)

.. _blacken-docs: https://github.com/adamchainz/blacken-docs

.. _release-2.8.0:

Scrapy 2.8.0 (2023-02-02)
-------------------------

This is a maintenance release, with minor features, bug fixes, and cleanups.

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   The ``scrapy.utils.gz.read1`` function, deprecated in Scrapy 2.0, has now
    been removed. Use the :meth:`~io.BufferedIOBase.read1` method of
    :class:`~gzip.GzipFile` instead.
    (:issue:`5719`)

-   The ``scrapy.utils.python.to_native_str`` function, deprecated in Scrapy
    2.0, has now been removed. Use :func:`scrapy.utils.python.to_unicode`
    instead.
    (:issue:`5719`)

-   The ``scrapy.utils.python.MutableChain.next`` method, deprecated in Scrapy
    2.0, has now been removed. Use
    :meth:`~scrapy.utils.python.MutableChain.__next__` instead.
    (:issue:`5719`)

-   The ``scrapy.linkextractors.FilteringLinkExtractor`` class, deprecated
    in Scrapy 2.0, has now been removed. Use
    :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    instead.
    (:issue:`5720`)

-   Support for using environment variables prefixed with ``SCRAPY_`` to
    override settings, deprecated in Scrapy 2.0, has now been removed.
    (:issue:`5724`)

-   Support for the ``noconnect`` query string argument in proxy URLs,
    deprecated in Scrapy 2.0, has now been removed. We expect proxies that used
    to need it to work fine without it.
    (:issue:`5731`)

-   The ``scrapy.utils.python.retry_on_eintr`` function, deprecated in Scrapy
    2.3, has now been removed.
    (:issue:`5719`)

-   The ``scrapy.utils.python.WeakKeyCache`` class, deprecated in Scrapy 2.4,
    has now been removed.
    (:issue:`5719`)

-   The ``scrapy.utils.boto.is_botocore()`` function, deprecated in Scrapy 2.4,
    has now been removed.
    (:issue:`5719`)

Deprecations
~~~~~~~~~~~~

-   :exc:`scrapy.pipelines.images.NoimagesDrop` is now deprecated.
    (:issue:`5368`, :issue:`5489`)

-   :meth:`ImagesPipeline.convert_image
    <scrapy.pipelines.images.ImagesPipeline.convert_image>` must now accept a
    ``response_body`` parameter.
    (:issue:`3055`, :issue:`3689`, :issue:`4753`)

New features
~~~~~~~~~~~~

-   Applied black_ coding style to files generated with the
    :command:`genspider` and :command:`startproject` commands.
    (:issue:`5809`, :issue:`5814`)

    .. _black: https://black.readthedocs.io/en/stable/

-   :setting:`FEED_EXPORT_ENCODING` is now set to ``"utf-8"`` in the
    ``settings.py`` file that the :command:`startproject` command generates.
    With this value, JSON exports won’t force the use of escape sequences for
    non-ASCII characters.
    (:issue:`5797`, :issue:`5800`)

-   The :class:`~scrapy.extensions.memusage.MemoryUsage` extension now logs the
    peak memory usage during checks, and the binary unit MiB is now used to
    avoid confusion.
    (:issue:`5717`, :issue:`5722`, :issue:`5727`)

-   The ``callback`` parameter of :class:`~scrapy.Request` can now be set
    to :func:`scrapy.http.request.NO_CALLBACK`, to distinguish it from
    ``None``, as the latter indicates that the default spider callback
    (:meth:`~scrapy.Spider.parse`) is to be used.
    (:issue:`5798`)

Bug fixes
~~~~~~~~~

-   Enabled unsafe legacy SSL renegotiation to fix access to some outdated
    websites.
    (:issue:`5491`, :issue:`5790`)

-   Fixed STARTTLS-based email delivery not working with Twisted 21.2.0 and
    better.
    (:issue:`5386`, :issue:`5406`)

-   Fixed the :meth:`finish_exporting` method of :ref:`item exporters
    <topics-exporters>` not being called for empty files.
    (:issue:`5537`, :issue:`5758`)

-   Fixed HTTP/2 responses getting only the last value for a header when
    multiple headers with the same name are received.
    (:issue:`5777`)

-   Fixed an exception raised by the :command:`shell` command on some cases
    when :ref:`using asyncio <using-asyncio>`.
    (:issue:`5740`, :issue:`5742`, :issue:`5748`, :issue:`5759`, :issue:`5760`,
    :issue:`5771`)

-   When using :class:`~scrapy.spiders.CrawlSpider`, callback keyword arguments
    (``cb_kwargs``) added to a request in the ``process_request`` callback of a
    :class:`~scrapy.spiders.Rule` will no longer be ignored.
    (:issue:`5699`)

-   The :ref:`images pipeline <images-pipeline>` no longer re-encodes JPEG
    files.
    (:issue:`3055`, :issue:`3689`, :issue:`4753`)

-   Fixed the handling of transparent WebP images by the :ref:`images pipeline
    <images-pipeline>`.
    (:issue:`3072`, :issue:`5766`, :issue:`5767`)

-   :func:`scrapy.shell.inspect_response` no longer inhibits ``SIGINT``
    (Ctrl+C).
    (:issue:`2918`)

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    with ``unique=False`` no longer filters out links that have identical URL
    *and* text.
    (:issue:`3798`, :issue:`3799`, :issue:`4695`, :issue:`5458`)

-   :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` now
    ignores URL protocols that do not support ``robots.txt`` (``data://``,
    ``file://``).
    (:issue:`5807`)

-   Silenced the ``filelock`` debug log messages introduced in Scrapy 2.6.
    (:issue:`5753`, :issue:`5754`)

-   Fixed the output of ``scrapy -h`` showing an unintended ``**commands**``
    line.
    (:issue:`5709`, :issue:`5711`, :issue:`5712`)

-   Made the active project indication in the output of :ref:`commands
    <topics-commands>` more clear.
    (:issue:`5715`)

Documentation
~~~~~~~~~~~~~

-   Documented how to :ref:`debug spiders from Visual Studio Code
    <debug-vscode>`.
    (:issue:`5721`)

-   Documented how :setting:`DOWNLOAD_DELAY` affects per-domain concurrency.
    (:issue:`5083`, :issue:`5540`)

-   Improved consistency.
    (:issue:`5761`)

-   Fixed typos.
    (:issue:`5714`, :issue:`5744`, :issue:`5764`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Applied :ref:`black coding style <coding-style>`, sorted import statements,
    and introduced :ref:`pre-commit <scrapy-pre-commit>`.
    (:issue:`4654`, :issue:`4658`, :issue:`5734`, :issue:`5737`, :issue:`5806`,
    :issue:`5810`)

-   Switched from :mod:`os.path` to :mod:`pathlib`.
    (:issue:`4916`, :issue:`4497`, :issue:`5682`)

-   Addressed many issues reported by Pylint.
    (:issue:`5677`)

-   Improved code readability.
    (:issue:`5736`)

-   Improved package metadata.
    (:issue:`5768`)

-   Removed direct invocations of ``setup.py``.
    (:issue:`5774`, :issue:`5776`)

-   Removed unnecessary :class:`~collections.OrderedDict` usages.
    (:issue:`5795`)

-   Removed unnecessary ``__str__`` definitions.
    (:issue:`5150`)

-   Removed obsolete code and comments.
    (:issue:`5725`, :issue:`5729`, :issue:`5730`, :issue:`5732`)

-   Fixed test and CI issues.
    (:issue:`5749`, :issue:`5750`, :issue:`5756`, :issue:`5762`, :issue:`5765`,
    :issue:`5780`, :issue:`5781`, :issue:`5782`, :issue:`5783`, :issue:`5785`,
    :issue:`5786`)

.. _release-2.7.1:

Scrapy 2.7.1 (2022-11-02)
-------------------------

New features
~~~~~~~~~~~~

-   Relaxed the restriction introduced in 2.6.2 so that the
    ``Proxy-Authorization`` header can again be set explicitly, as long as the
    proxy URL in the :reqmeta:`proxy` metadata has no other credentials, and
    for as long as that proxy URL remains the same; this restores compatibility
    with scrapy-zyte-smartproxy 2.1.0 and older (:issue:`5626`).

Bug fixes
~~~~~~~~~

-   Using ``-O``/``--overwrite-output`` and ``-t``/``--output-format`` options
    together now produces an error instead of ignoring the former option
    (:issue:`5516`, :issue:`5605`).

-   Replaced deprecated :mod:`asyncio` APIs that implicitly use the current
    event loop with code that explicitly requests a loop from the event loop
    policy (:issue:`5685`, :issue:`5689`).

-   Fixed uses of deprecated Scrapy APIs in Scrapy itself (:issue:`5588`,
    :issue:`5589`).

-   Fixed uses of a deprecated Pillow API (:issue:`5684`, :issue:`5692`).

-   Improved code that checks if generators return values, so that it no longer
    fails on decorated methods and partial methods (:issue:`5323`,
    :issue:`5592`, :issue:`5599`, :issue:`5691`).

Documentation
~~~~~~~~~~~~~

-   Upgraded the Code of Conduct to Contributor Covenant v2.1 (:issue:`5698`).

-   Fixed typos (:issue:`5681`, :issue:`5694`).

Quality assurance
~~~~~~~~~~~~~~~~~

-   Re-enabled some erroneously disabled flake8 checks (:issue:`5688`).

-   Ignored harmless deprecation warnings from :mod:`typing` in tests
    (:issue:`5686`, :issue:`5697`).

-   Modernized our CI configuration (:issue:`5695`, :issue:`5696`).

.. _release-2.7.0:

Scrapy 2.7.0 (2022-10-17)
-----------------------------

Highlights:

-   Added Python 3.11 support, dropped Python 3.6 support
-   Improved support for :ref:`asynchronous callbacks <topics-coroutines>`
-   :ref:`Asyncio support <using-asyncio>` is enabled by default on new
    projects
-   Output names of item fields can now be arbitrary strings
-   Centralized :ref:`request fingerprinting <request-fingerprints>`
    configuration is now possible

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

Python 3.7 or greater is now required; support for Python 3.6 has been dropped.
Support for the upcoming Python 3.11 has been added.

The minimum required version of some dependencies has changed as well:

-   lxml_: 3.5.0 → 4.3.0

-   Pillow_ (:ref:`images pipeline <images-pipeline>`): 4.0.0 → 7.1.0

-   zope.interface_: 5.0.0 → 5.1.0

(:issue:`5512`, :issue:`5514`, :issue:`5524`, :issue:`5563`, :issue:`5664`,
:issue:`5670`, :issue:`5678`)

Deprecations
~~~~~~~~~~~~

-   :meth:`ImagesPipeline.thumb_path
    <scrapy.pipelines.images.ImagesPipeline.thumb_path>` must now accept an
    ``item`` parameter (:issue:`5504`, :issue:`5508`).

-   The ``scrapy.downloadermiddlewares.decompression`` module is now
    deprecated (:issue:`5546`, :issue:`5547`).

New features
~~~~~~~~~~~~

-   The
    :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output`
    method of :ref:`spider middlewares <topics-spider-middleware>` can now be
    defined as an :term:`asynchronous generator` (:issue:`4978`).

-   The output of :class:`~scrapy.Request` callbacks defined as
    :ref:`coroutines <topics-coroutines>` is now processed asynchronously
    (:issue:`4978`).

-   :class:`~scrapy.spiders.crawl.CrawlSpider` now supports :ref:`asynchronous
    callbacks <topics-coroutines>` (:issue:`5657`).

-   New projects created with the :command:`startproject` command have
    :ref:`asyncio support <using-asyncio>` enabled by default (:issue:`5590`,
    :issue:`5679`).

-   The :setting:`FEED_EXPORT_FIELDS` setting can now be defined as a
    dictionary to customize the output name of item fields, lifting the
    restriction that required output names to be valid Python identifiers, e.g.
    preventing them to have whitespace (:issue:`1008`, :issue:`3266`,
    :issue:`3696`).

-   You can now customize :ref:`request fingerprinting <request-fingerprints>`
    through the new :setting:`REQUEST_FINGERPRINTER_CLASS` setting, instead of
    having to change it on every Scrapy component that relies on request
    fingerprinting (:issue:`900`, :issue:`3420`, :issue:`4113`, :issue:`4762`,
    :issue:`4524`).

-   ``jsonl`` is now supported and encouraged as a file extension for `JSON
    Lines`_ files (:issue:`4848`).

    .. _JSON Lines: https://jsonlines.org/

-   :meth:`ImagesPipeline.thumb_path
    <scrapy.pipelines.images.ImagesPipeline.thumb_path>` now receives the
    source :ref:`item <topics-items>` (:issue:`5504`, :issue:`5508`).

Bug fixes
~~~~~~~~~

-   When using Google Cloud Storage with a :ref:`media pipeline
    <topics-media-pipeline>`, :setting:`FILES_EXPIRES` now also works when
    :setting:`FILES_STORE` does not point at the root of your Google Cloud
    Storage bucket (:issue:`5317`, :issue:`5318`).

-   The :command:`parse` command now supports :ref:`asynchronous callbacks
    <topics-coroutines>` (:issue:`5424`, :issue:`5577`).

-   When using the :command:`parse` command with a URL for which there is no
    available spider, an exception is no longer raised (:issue:`3264`,
    :issue:`3265`, :issue:`5375`, :issue:`5376`, :issue:`5497`).

-   :class:`~scrapy.http.TextResponse` now gives higher priority to the `byte
    order mark`_ when determining the text encoding of the response body,
    following the `HTML living standard`_ (:issue:`5601`, :issue:`5611`).

    .. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark
    .. _HTML living standard: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding

-   MIME sniffing takes the response body into account in FTP and HTTP/1.0
    requests, as well as in cached requests (:issue:`4873`).

-   MIME sniffing now detects valid HTML 5 documents even if the ``html`` tag
    is missing (:issue:`4873`).

-   An exception is now raised if :setting:`ASYNCIO_EVENT_LOOP` has a value
    that does not match the asyncio event loop actually installed
    (:issue:`5529`).

-   Fixed :meth:`Headers.getlist() <scrapy.http.headers.Headers.getlist>`
    returning only the last header (:issue:`5515`, :issue:`5526`).

-   Fixed :class:`LinkExtractor
    <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` not ignoring the
    ``tar.gz`` file extension by default (:issue:`1837`, :issue:`2067`,
    :issue:`4066`)

Documentation
~~~~~~~~~~~~~

-   Clarified the return type of :meth:`Spider.parse <scrapy.Spider.parse>`
    (:issue:`5602`, :issue:`5608`).

-   To enable
    :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`
    to do `brotli compression`_, installing brotli_ is now recommended instead
    of installing brotlipy_, as the former provides a more recent version of
    brotli.

    .. _brotli: https://github.com/google/brotli
    .. _brotli compression: https://www.ietf.org/rfc/rfc7932.txt

-   :ref:`Signal documentation <topics-signals>` now mentions :ref:`coroutine
    support <topics-coroutines>` and uses it in code examples (:issue:`4852`,
    :issue:`5358`).

-   :ref:`bans` now recommends `Common Crawl`_ instead of `Google cache`_
    (:issue:`3582`, :issue:`5432`).

    .. _Common Crawl: https://commoncrawl.org/
    .. _Google cache: https://www.googleguide.com/cached_pages.html

-   The new :ref:`topics-components` topic covers enforcing requirements on
    Scrapy components, like :ref:`downloader middlewares
    <topics-downloader-middleware>`, :ref:`extensions <topics-extensions>`,
    :ref:`item pipelines <topics-item-pipeline>`, :ref:`spider middlewares
    <topics-spider-middleware>`, and more; :ref:`enforce-asyncio-requirement`
    has also been added (:issue:`4978`).

-   :ref:`topics-settings` now indicates that setting values must be
    :ref:`picklable <pickle-picklable>` (:issue:`5607`, :issue:`5629`).

-   Removed outdated documentation (:issue:`5446`, :issue:`5373`,
    :issue:`5369`, :issue:`5370`, :issue:`5554`).

-   Fixed typos (:issue:`5442`, :issue:`5455`, :issue:`5457`, :issue:`5461`,
    :issue:`5538`, :issue:`5553`, :issue:`5558`, :issue:`5624`, :issue:`5631`).

-   Fixed other issues (:issue:`5283`, :issue:`5284`, :issue:`5559`,
    :issue:`5567`, :issue:`5648`, :issue:`5659`, :issue:`5665`).

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added a continuous integration job to run `twine check`_ (:issue:`5655`,
    :issue:`5656`).

    .. _twine check: https://twine.readthedocs.io/en/stable/#twine-check

-   Addressed test issues and warnings (:issue:`5560`, :issue:`5561`,
    :issue:`5612`, :issue:`5617`, :issue:`5639`, :issue:`5645`, :issue:`5662`,
    :issue:`5671`, :issue:`5675`).

-   Cleaned up code (:issue:`4991`, :issue:`4995`, :issue:`5451`,
    :issue:`5487`, :issue:`5542`, :issue:`5667`, :issue:`5668`, :issue:`5672`).

-   Applied minor code improvements (:issue:`5661`).

.. _release-2.6.3:

Scrapy 2.6.3 (2022-09-27)
-------------------------

-   Added support for pyOpenSSL_ 22.1.0, removing support for SSLv3
    (:issue:`5634`, :issue:`5635`, :issue:`5636`).

-   Upgraded the minimum versions of the following dependencies:

    -   cryptography_: 2.0 → 3.3

    -   pyOpenSSL_: 16.2.0 → 21.0.0

    -   service_identity_: 16.0.0 → 18.1.0

    -   Twisted_: 17.9.0 → 18.9.0

    -   zope.interface_: 4.1.3 → 5.0.0

    (:issue:`5621`, :issue:`5632`)

-   Fixes test and documentation issues (:issue:`5612`, :issue:`5617`,
    :issue:`5631`).

.. _release-2.6.2:

Scrapy 2.6.2 (2022-07-25)
-------------------------

**Security bug fix:**

-   When :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`
    processes a request with :reqmeta:`proxy` metadata, and that
    :reqmeta:`proxy` metadata includes proxy credentials,
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` sets
    the ``Proxy-Authorization`` header, but only if that header is not already
    set.

    There are third-party proxy-rotation downloader middlewares that set
    different :reqmeta:`proxy` metadata every time they process a request.

    Because of request retries and redirects, the same request can be processed
    by downloader middlewares more than once, including both
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` and
    any third-party proxy-rotation downloader middleware.

    These third-party proxy-rotation downloader middlewares could change the
    :reqmeta:`proxy` metadata of a request to a new value, but fail to remove
    the ``Proxy-Authorization`` header from the previous value of the
    :reqmeta:`proxy` metadata, causing the credentials of one proxy to be sent
    to a different proxy.

    To prevent the unintended leaking of proxy credentials, the behavior of
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` is now
    as follows when processing a request:

    -   If the request being processed defines :reqmeta:`proxy` metadata that
        includes credentials, the ``Proxy-Authorization`` header is always
        updated to feature those credentials.

    -   If the request being processed defines :reqmeta:`proxy` metadata
        without credentials, the ``Proxy-Authorization`` header is removed
        *unless* it was originally defined for the same proxy URL.

        To remove proxy credentials while keeping the same proxy URL, remove
        the ``Proxy-Authorization`` header.

    -   If the request has no :reqmeta:`proxy` metadata, or that metadata is a
        falsy value (e.g. ``None``), the ``Proxy-Authorization`` header is
        removed.

        It is no longer possible to set a proxy URL through the
        :reqmeta:`proxy` metadata but set the credentials through the
        ``Proxy-Authorization`` header. Set proxy credentials through the
        :reqmeta:`proxy` metadata instead.

Also fixes the following regressions introduced in 2.6.0:

-   :class:`~scrapy.crawler.CrawlerProcess` supports again crawling multiple
    spiders (:issue:`5435`, :issue:`5436`)

-   Installing a Twisted reactor before Scrapy does (e.g. importing
    :mod:`twisted.internet.reactor` somewhere at the module level) no longer
    prevents Scrapy from starting, as long as a different reactor is not
    specified in :setting:`TWISTED_REACTOR` (:issue:`5525`, :issue:`5528`)

-   Fixed an exception that was being logged after the spider finished under
    certain conditions (:issue:`5437`, :issue:`5440`)

-   The ``--output``/``-o`` command-line parameter supports again a value
    starting with a hyphen (:issue:`5444`, :issue:`5445`)

-   The ``scrapy parse -h`` command no longer throws an error (:issue:`5481`,
    :issue:`5482`)

.. _release-2.6.1:

Scrapy 2.6.1 (2022-03-01)
-------------------------

Fixes a regression introduced in 2.6.0 that would unset the request method when
following redirects.

.. _release-2.6.0:

Scrapy 2.6.0 (2022-03-01)
-------------------------

Highlights:

*   :ref:`Security fixes for cookie handling <2.6-security-fixes>`

*   Python 3.10 support

*   :ref:`asyncio support <using-asyncio>` is no longer considered
    experimental, and works out-of-the-box on Windows regardless of your Python
    version

*   Feed exports now support :class:`pathlib.Path` output paths and per-feed
    :ref:`item filtering <item-filter>` and
    :ref:`post-processing <post-processing>`

.. _2.6-security-fixes:

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   When a :class:`~scrapy.Request` object with cookies defined gets a
    redirect response causing a new :class:`~scrapy.Request` object to be
    scheduled, the cookies defined in the original
    :class:`~scrapy.Request` object are no longer copied into the new
    :class:`~scrapy.Request` object.

    If you manually set the ``Cookie`` header on a
    :class:`~scrapy.Request` object and the domain name of the redirect
    URL is not an exact match for the domain of the URL of the original
    :class:`~scrapy.Request` object, your ``Cookie`` header is now dropped
    from the new :class:`~scrapy.Request` object.

    The old behavior could be exploited by an attacker to gain access to your
    cookies. Please, see the `cjvr-mfj7-j4j8 security advisory`_ for more
    information.

    .. _cjvr-mfj7-j4j8 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cjvr-mfj7-j4j8

    .. note:: It is still possible to enable the sharing of cookies between
              different domains with a shared domain suffix (e.g.
              ``example.com`` and any subdomain) by defining the shared domain
              suffix (e.g. ``example.com``) as the cookie domain when defining
              your cookies. See the documentation of the
              :class:`~scrapy.Request` class for more information.

-   When the domain of a cookie, either received in the ``Set-Cookie`` header
    of a response or defined in a :class:`~scrapy.Request` object, is set
    to a `public suffix <https://publicsuffix.org/>`_, the cookie is now
    ignored unless the cookie domain is the same as the request domain.

    The old behavior could be exploited by an attacker to inject cookies from a
    controlled domain into your cookiejar that could be sent to other domains
    not controlled by the attacker. Please, see the `mfjm-vh54-3f96 security
    advisory`_ for more information.

    .. _mfjm-vh54-3f96 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-mfjm-vh54-3f96

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   The h2_ dependency is now optional, only needed to
    :ref:`enable HTTP/2 support <twisted-http2-handler>`. (:issue:`5113`)

    .. _h2: https://pypi.org/project/h2/

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   The ``formdata`` parameter of :class:`~scrapy.FormRequest`, if specified
    for a non-POST request, now overrides the URL query string, instead of
    being appended to it. (:issue:`2919`, :issue:`3579`)

-   When a function is assigned to the :setting:`FEED_URI_PARAMS` setting, now
    the return value of that function, and not the ``params`` input parameter,
    will determine the feed URI parameters, unless that return value is
    ``None``. (:issue:`4962`, :issue:`4966`)

-   In :class:`scrapy.core.engine.ExecutionEngine`, methods
    :meth:`~scrapy.core.engine.ExecutionEngine.crawl`,
    :meth:`~scrapy.core.engine.ExecutionEngine.download`,
    :meth:`~scrapy.core.engine.ExecutionEngine.schedule`,
    and :meth:`~scrapy.core.engine.ExecutionEngine.spider_is_idle`
    now raise :exc:`RuntimeError` if called before
    :meth:`~scrapy.core.engine.ExecutionEngine.open_spider`. (:issue:`5090`)

    These methods used to assume that
    :attr:`ExecutionEngine.slot <scrapy.core.engine.ExecutionEngine.slot>` had
    been defined by a prior call to
    :meth:`~scrapy.core.engine.ExecutionEngine.open_spider`, so they were
    raising :exc:`AttributeError` instead.

-   If the API of the configured :ref:`scheduler <topics-scheduler>` does not
    meet expectations, :exc:`TypeError` is now raised at startup time. Before,
    other exceptions would be raised at run time. (:issue:`3559`)

-   The ``_encoding`` field of serialized :class:`~scrapy.Request` objects
    is now named ``encoding``, in line with all other fields (:issue:`5130`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   ``scrapy.http.TextResponse.body_as_unicode``, deprecated in Scrapy 2.2, has
    now been removed. (:issue:`5393`)

-   ``scrapy.item.BaseItem``, deprecated in Scrapy 2.2, has now been removed.
    (:issue:`5398`)

-   ``scrapy.item.DictItem``, deprecated in Scrapy 1.8, has now been removed.
    (:issue:`5398`)

-   ``scrapy.Spider.make_requests_from_url``, deprecated in Scrapy 1.4, has now
    been removed. (:issue:`4178`, :issue:`4356`)

Deprecations
~~~~~~~~~~~~

-   When a function is assigned to the :setting:`FEED_URI_PARAMS` setting,
    returning ``None`` or modifying the ``params`` input parameter is now
    deprecated. Return a new dictionary instead. (:issue:`4962`, :issue:`4966`)

-   :mod:`scrapy.utils.reqser` is deprecated. (:issue:`5130`)

    -   Instead of :func:`~scrapy.utils.reqser.request_to_dict`, use the new
        :meth:`.Request.to_dict` method.

    -   Instead of :func:`~scrapy.utils.reqser.request_from_dict`, use the new
        :func:`scrapy.utils.request.request_from_dict` function.

-   In :mod:`scrapy.squeues`, the following queue classes are deprecated:
    :class:`~scrapy.squeues.PickleFifoDiskQueueNonRequest`,
    :class:`~scrapy.squeues.PickleLifoDiskQueueNonRequest`,
    :class:`~scrapy.squeues.MarshalFifoDiskQueueNonRequest`,
    and :class:`~scrapy.squeues.MarshalLifoDiskQueueNonRequest`. You should
    instead use:
    :class:`~scrapy.squeues.PickleFifoDiskQueue`,
    :class:`~scrapy.squeues.PickleLifoDiskQueue`,
    :class:`~scrapy.squeues.MarshalFifoDiskQueue`,
    and :class:`~scrapy.squeues.MarshalLifoDiskQueue`. (:issue:`5117`)

-   Many aspects of :class:`scrapy.core.engine.ExecutionEngine` that come from
    a time when this class could handle multiple :class:`~scrapy.Spider`
    objects at a time have been deprecated. (:issue:`5090`)

    -   The :meth:`~scrapy.core.engine.ExecutionEngine.has_capacity` method
        is deprecated.

    -   The :meth:`~scrapy.core.engine.ExecutionEngine.schedule` method is
        deprecated, use :meth:`~scrapy.core.engine.ExecutionEngine.crawl` or
        :meth:`~scrapy.core.engine.ExecutionEngine.download` instead.

    -   The :attr:`~scrapy.core.engine.ExecutionEngine.open_spiders` attribute
        is deprecated, use :attr:`~scrapy.core.engine.ExecutionEngine.spider`
        instead.

    -   The ``spider`` parameter is deprecated for the following methods:

        -   :meth:`~scrapy.core.engine.ExecutionEngine.spider_is_idle`

        -   :meth:`~scrapy.core.engine.ExecutionEngine.crawl`

        -   :meth:`~scrapy.core.engine.ExecutionEngine.download`

        Instead, call :meth:`~scrapy.core.engine.ExecutionEngine.open_spider`
        first to set the :class:`~scrapy.Spider` object.

-   :func:`scrapy.utils.response.response_httprepr` is now deprecated.
    (:issue:`4972`)

New features
~~~~~~~~~~~~

-   You can now use :ref:`item filtering <item-filter>` to control which items
    are exported to each output feed. (:issue:`4575`, :issue:`5178`,
    :issue:`5161`, :issue:`5203`)

-   You can now apply :ref:`post-processing <post-processing>` to feeds, and
    :ref:`built-in post-processing plugins <builtin-plugins>` are provided for
    output file compression. (:issue:`2174`, :issue:`5168`, :issue:`5190`)

-   The :setting:`FEEDS` setting now supports :class:`pathlib.Path` objects as
    keys. (:issue:`5383`, :issue:`5384`)

-   Enabling :ref:`asyncio <using-asyncio>` while using Windows and Python 3.8
    or later will automatically switch the asyncio event loop to one that
    allows Scrapy to work. See :ref:`asyncio-windows`. (:issue:`4976`,
    :issue:`5315`)

-   The :command:`genspider` command now supports a start URL instead of a
    domain name. (:issue:`4439`)

-   :mod:`scrapy.utils.defer` gained 2 new functions,
    :func:`~scrapy.utils.defer.deferred_to_future` and
    :func:`~scrapy.utils.defer.maybe_deferred_to_future`, to help :ref:`await
    on Deferreds when using the asyncio reactor <asyncio-await-dfd>`.
    (:issue:`5288`)

-   :ref:`Amazon S3 feed export storage <topics-feed-storage-s3>` gained
    support for `temporary security credentials`_
    (:setting:`AWS_SESSION_TOKEN`) and endpoint customization
    (:setting:`AWS_ENDPOINT_URL`). (:issue:`4998`, :issue:`5210`)

    .. _temporary security credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html

-   New :setting:`LOG_FILE_APPEND` setting to allow truncating the log file.
    (:issue:`5279`)

-   :attr:`Request.cookies <scrapy.Request.cookies>` values that are
    :class:`bool`, :class:`float` or :class:`int` are cast to :class:`str`.
    (:issue:`5252`, :issue:`5253`)

-   You may now raise :exc:`~scrapy.exceptions.CloseSpider` from a handler of
    the :signal:`spider_idle` signal to customize the reason why the spider is
    stopping. (:issue:`5191`)

-   When using
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`, the
    proxy URL for non-HTTPS HTTP/1.1 requests no longer needs to include a URL
    scheme. (:issue:`4505`, :issue:`4649`)

-   All built-in queues now expose a ``peek`` method that returns the next
    queue object (like ``pop``) but does not remove the returned object from
    the queue. (:issue:`5112`)

    If the underlying queue does not support peeking (e.g. because you are not
    using ``queuelib`` 1.6.1 or later), the ``peek`` method raises
    :exc:`NotImplementedError`.

-   :class:`~scrapy.Request` and :class:`~scrapy.http.Response` now have
    an ``attributes`` attribute that makes subclassing easier. For
    :class:`~scrapy.Request`, it also allows subclasses to work with
    :func:`scrapy.utils.request.request_from_dict`. (:issue:`1877`,
    :issue:`5130`, :issue:`5218`)

-   The :meth:`~scrapy.core.scheduler.BaseScheduler.open` and
    :meth:`~scrapy.core.scheduler.BaseScheduler.close` methods of the
    :ref:`scheduler <topics-scheduler>` are now optional. (:issue:`3559`)

-   HTTP/1.1 :exc:`~scrapy.core.downloader.handlers.http11.TunnelError`
    exceptions now only truncate response bodies longer than 1000 characters,
    instead of those longer than 32 characters, making it easier to debug such
    errors. (:issue:`4881`, :issue:`5007`)

-   :class:`~scrapy.loader.ItemLoader` now supports non-text responses.
    (:issue:`5145`, :issue:`5269`)

Bug fixes
~~~~~~~~~

-   The :setting:`TWISTED_REACTOR` and :setting:`ASYNCIO_EVENT_LOOP` settings
    are no longer ignored if defined in :attr:`~scrapy.Spider.custom_settings`.
    (:issue:`4485`, :issue:`5352`)

-   Removed a module-level Twisted reactor import that could prevent
    :ref:`using the asyncio reactor <using-asyncio>`. (:issue:`5357`)

-   The :command:`startproject` command works with existing folders again.
    (:issue:`4665`, :issue:`4676`)

-   The :setting:`FEED_URI_PARAMS` setting now behaves as documented.
    (:issue:`4962`, :issue:`4966`)

-   :attr:`Request.cb_kwargs <scrapy.Request.cb_kwargs>` once again allows the
    ``callback`` keyword. (:issue:`5237`, :issue:`5251`, :issue:`5264`)

-   Made :func:`scrapy.utils.response.open_in_browser` support more complex
    HTML. (:issue:`5319`, :issue:`5320`)

-   Fixed :attr:`CSVFeedSpider.quotechar
    <scrapy.spiders.CSVFeedSpider.quotechar>` being interpreted as the CSV file
    encoding. (:issue:`5391`, :issue:`5394`)

-   Added missing setuptools_ to the list of dependencies. (:issue:`5122`)

    .. _setuptools: https://pypi.org/project/setuptools/

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    now also works as expected with links that have comma-separated ``rel``
    attribute values including ``nofollow``. (:issue:`5225`)

-   Fixed a :exc:`TypeError` that could be raised during :ref:`feed export
    <topics-feed-exports>` parameter parsing. (:issue:`5359`)

Documentation
~~~~~~~~~~~~~

-   :ref:`asyncio support <using-asyncio>` is no longer considered
    experimental. (:issue:`5332`)

-   Included :ref:`Windows-specific help for asyncio usage <asyncio-windows>`.
    (:issue:`4976`, :issue:`5315`)

-   Rewrote :ref:`topics-headless-browsing` with up-to-date best practices.
    (:issue:`4484`, :issue:`4613`)

-   Documented :ref:`local file naming in media pipelines
    <topics-file-naming>`. (:issue:`5069`, :issue:`5152`)

-   :ref:`faq` now covers spider file name collision issues. (:issue:`2680`,
    :issue:`3669`)

-   Provided better context and instructions to disable the
    :setting:`URLLENGTH_LIMIT` setting. (:issue:`5135`, :issue:`5250`)

-   Documented that Reppy parser does not support Python 3.9+.
    (:issue:`5226`, :issue:`5231`)

-   Documented :ref:`the scheduler component <topics-scheduler>`.
    (:issue:`3537`, :issue:`3559`)

-   Documented the method used by :ref:`media pipelines
    <topics-media-pipeline>` to :ref:`determine if a file has expired
    <file-expiration>`. (:issue:`5120`, :issue:`5254`)

-   :ref:`run-multiple-spiders` now features
    :func:`scrapy.utils.project.get_project_settings` usage. (:issue:`5070`)

-   :ref:`run-multiple-spiders` now covers what happens when you define
    different per-spider values for some settings that cannot differ at run
    time. (:issue:`4485`, :issue:`5352`)

-   Extended the documentation of the
    :class:`~scrapy.extensions.statsmailer.StatsMailer` extension.
    (:issue:`5199`, :issue:`5217`)

-   Added :setting:`JOBDIR` to :ref:`topics-settings`. (:issue:`5173`,
    :issue:`5224`)

-   Documented :attr:`Spider.attribute <scrapy.Spider.attribute>`.
    (:issue:`5174`, :issue:`5244`)

-   Documented :attr:`TextResponse.urljoin <scrapy.http.TextResponse.urljoin>`.
    (:issue:`1582`)

-   Added the ``body_length`` parameter to the documented signature of the
    :signal:`headers_received` signal. (:issue:`5270`)

-   Clarified :meth:`SelectorList.get <scrapy.selector.SelectorList.get>` usage
    in the :ref:`tutorial <intro-tutorial>`. (:issue:`5256`)

-   The documentation now features the shortest import path of classes with
    multiple import paths. (:issue:`2733`, :issue:`5099`)

-   ``quotes.toscrape.com`` references now use HTTPS instead of HTTP.
    (:issue:`5395`, :issue:`5396`)

-   Added a link to `our Discord server <https://discord.com/invite/mv3yErfpvq>`_
    to :ref:`getting-help`. (:issue:`5421`, :issue:`5422`)

-   The pronunciation of the project name is now :ref:`officially
    <intro-overview>` /ˈskreɪpaɪ/. (:issue:`5280`, :issue:`5281`)

-   Added the Scrapy logo to the README. (:issue:`5255`, :issue:`5258`)

-   Fixed issues and implemented minor improvements. (:issue:`3155`,
    :issue:`4335`, :issue:`5074`, :issue:`5098`, :issue:`5134`, :issue:`5180`,
    :issue:`5194`, :issue:`5239`, :issue:`5266`, :issue:`5271`, :issue:`5273`,
    :issue:`5274`, :issue:`5276`, :issue:`5347`, :issue:`5356`, :issue:`5414`,
    :issue:`5415`, :issue:`5416`, :issue:`5419`, :issue:`5420`)

Quality Assurance
~~~~~~~~~~~~~~~~~

-   Added support for Python 3.10. (:issue:`5212`, :issue:`5221`,
    :issue:`5265`)

-   Significantly reduced memory usage by
    :func:`scrapy.utils.response.response_httprepr`, used by the
    :class:`~scrapy.downloadermiddlewares.stats.DownloaderStats` downloader
    middleware, which is enabled by default. (:issue:`4964`, :issue:`4972`)

-   Removed uses of the deprecated :mod:`optparse` module. (:issue:`5366`,
    :issue:`5374`)

-   Extended typing hints. (:issue:`5077`, :issue:`5090`, :issue:`5100`,
    :issue:`5108`, :issue:`5171`, :issue:`5215`, :issue:`5334`)

-   Improved tests, fixed CI issues, removed unused code. (:issue:`5094`,
    :issue:`5157`, :issue:`5162`, :issue:`5198`, :issue:`5207`, :issue:`5208`,
    :issue:`5229`, :issue:`5298`, :issue:`5299`, :issue:`5310`, :issue:`5316`,
    :issue:`5333`, :issue:`5388`, :issue:`5389`, :issue:`5400`, :issue:`5401`,
    :issue:`5404`, :issue:`5405`, :issue:`5407`, :issue:`5410`, :issue:`5412`,
    :issue:`5425`, :issue:`5427`)

-   Implemented improvements for contributors. (:issue:`5080`, :issue:`5082`,
    :issue:`5177`, :issue:`5200`)

-   Implemented cleanups. (:issue:`5095`, :issue:`5106`, :issue:`5209`,
    :issue:`5228`, :issue:`5235`, :issue:`5245`, :issue:`5246`, :issue:`5292`,
    :issue:`5314`, :issue:`5322`)

.. _release-2.5.1:

Scrapy 2.5.1 (2021-10-05)
-------------------------

*   **Security bug fix:**

    If you use
    :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`
    (i.e. the ``http_user`` and ``http_pass`` spider attributes) for HTTP
    authentication, any request exposes your credentials to the request target.

    To prevent unintended exposure of authentication credentials to unintended
    domains, you must now additionally set a new, additional spider attribute,
    ``http_auth_domain``, and point it to the specific domain to which the
    authentication credentials must be sent.

    If the ``http_auth_domain`` spider attribute is not set, the domain of the
    first request will be considered the HTTP authentication target, and
    authentication credentials will only be sent in requests targeting that
    domain.

    If you need to send the same HTTP authentication credentials to multiple
    domains, you can use :func:`w3lib.http.basic_auth_header` instead to
    set the value of the ``Authorization`` header of your requests.

    If you *really* want your spider to send the same HTTP authentication
    credentials to any domain, set the ``http_auth_domain`` spider attribute
    to ``None``.

    Finally, if you are a user of `scrapy-splash`_, know that this version of
    Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will
    need to upgrade scrapy-splash to a greater version for it to continue to
    work.

.. _release-2.5.0:

Scrapy 2.5.0 (2021-04-06)
-------------------------

Highlights:

-   Official Python 3.9 support

-   Experimental :ref:`HTTP/2 support <twisted-http2-handler>`

-   New :func:`~scrapy.downloadermiddlewares.retry.get_retry_request` function
    to retry requests from spider callbacks

-   New :class:`~scrapy.signals.headers_received` signal that allows stopping
    downloads early

-   New :class:`Response.protocol <scrapy.http.Response.protocol>` attribute

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

-   Removed all code that :ref:`was deprecated in 1.7.0 <1.7-deprecations>` and
    had not :ref:`already been removed in 2.4.0 <2.4-deprecation-removals>`.
    (:issue:`4901`)

-   Removed support for the ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` environment
    variable, :ref:`deprecated in 1.8.0 <1.8-deprecations>`. (:issue:`4912`)

Deprecations
~~~~~~~~~~~~

-   The :mod:`scrapy.utils.py36` module is now deprecated in favor of
    :mod:`scrapy.utils.asyncgen`. (:issue:`4900`)

New features
~~~~~~~~~~~~

-   Experimental :ref:`HTTP/2 support <twisted-http2-handler>` through a new download handler
    that can be assigned to the ``https`` protocol in the
    :setting:`DOWNLOAD_HANDLERS` setting.
    (:issue:`1854`, :issue:`4769`, :issue:`5058`, :issue:`5059`, :issue:`5066`)

-   The new :func:`scrapy.downloadermiddlewares.retry.get_retry_request`
    function may be used from spider callbacks or middlewares to handle the
    retrying of a request beyond the scenarios that
    :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` supports.
    (:issue:`3590`, :issue:`3685`, :issue:`4902`)

-   The new :class:`~scrapy.signals.headers_received` signal gives early access
    to response headers and allows :ref:`stopping downloads
    <topics-stop-response-download>`.
    (:issue:`1772`, :issue:`4897`)

-   The new :attr:`Response.protocol <scrapy.http.Response.protocol>`
    attribute gives access to the string that identifies the protocol used to
    download a response. (:issue:`4878`)

-   :ref:`Stats <topics-stats>` now include the following entries that indicate
    the number of successes and failures in storing
    :ref:`feeds <topics-feed-exports>`::

        feedexport/success_count/<storage type>
        feedexport/failed_count/<storage type>

    Where ``<storage type>`` is the feed storage backend class name, such as
    :class:`~scrapy.extensions.feedexport.FileFeedStorage` or
    :class:`~scrapy.extensions.feedexport.FTPFeedStorage`.

    (:issue:`3947`, :issue:`4850`)

-   The :class:`~scrapy.spidermiddlewares.urllength.UrlLengthMiddleware` spider
    middleware now logs ignored URLs with ``INFO`` :ref:`logging level
    <levels>` instead of ``DEBUG``, and it now includes the following entry
    into :ref:`stats <topics-stats>` to keep track of the number of ignored
    URLs::

        urllength/request_ignored_count

    (:issue:`5036`)

-   The
    :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`
    downloader middleware now logs the number of decompressed responses and the
    total count of resulting bytes::

        httpcompression/response_bytes
        httpcompression/response_count

    (:issue:`4797`, :issue:`4799`)

Bug fixes
~~~~~~~~~

-   Fixed installation on PyPy installing PyDispatcher in addition to
    PyPyDispatcher, which could prevent Scrapy from working depending on which
    package got imported. (:issue:`4710`, :issue:`4814`)

-   When inspecting a callback to check if it is a generator that also returns
    a value, an exception is no longer raised if the callback has a docstring
    with lower indentation than the following code.
    (:issue:`4477`, :issue:`4935`)

-   The `Content-Length <https://datatracker.ietf.org/doc/html/rfc2616#section-14.13>`_
    header is no longer omitted from responses when using the default, HTTP/1.1
    download handler (see :setting:`DOWNLOAD_HANDLERS`).
    (:issue:`5009`, :issue:`5034`, :issue:`5045`, :issue:`5057`, :issue:`5062`)

-   Setting the :reqmeta:`handle_httpstatus_all` request meta key to ``False``
    now has the same effect as not setting it at all, instead of having the
    same effect as setting it to ``True``.
    (:issue:`3851`, :issue:`4694`)

Documentation
~~~~~~~~~~~~~

-   Added instructions to :ref:`install Scrapy in Windows using pip
    <intro-install-windows>`.
    (:issue:`4715`, :issue:`4736`)

-   Logging documentation now includes :ref:`additional ways to filter logs
    <topics-logging-advanced-customization>`.
    (:issue:`4216`, :issue:`4257`, :issue:`4965`)

-   Covered how to deal with long lists of allowed domains in the :ref:`FAQ
    <faq>`. (:issue:`2263`, :issue:`3667`)

-   Covered scrapy-bench_ in :ref:`benchmarking`.
    (:issue:`4996`, :issue:`5016`)

-   Clarified that one :ref:`extension <topics-extensions>` instance is created
    per crawler.
    (:issue:`5014`)

-   Fixed some errors in examples.
    (:issue:`4829`, :issue:`4830`, :issue:`4907`, :issue:`4909`,
    :issue:`5008`)

-   Fixed some external links, typos, and so on.
    (:issue:`4892`, :issue:`4899`, :issue:`4936`, :issue:`4942`, :issue:`5005`,
    :issue:`5063`)

-   The :ref:`list of Request.meta keys <topics-request-meta>` is now sorted
    alphabetically.
    (:issue:`5061`, :issue:`5065`)

-   Updated references to Scrapinghub, which is now called Zyte.
    (:issue:`4973`, :issue:`5072`)

-   Added a mention to contributors in the README. (:issue:`4956`)

-   Reduced the top margin of lists. (:issue:`4974`)

Quality Assurance
~~~~~~~~~~~~~~~~~

-   Made Python 3.9 support official (:issue:`4757`, :issue:`4759`)

-   Extended typing hints (:issue:`4895`)

-   Fixed deprecated uses of the Twisted API.
    (:issue:`4940`, :issue:`4950`, :issue:`5073`)

-   Made our tests run with the new pip resolver.
    (:issue:`4710`, :issue:`4814`)

-   Added tests to ensure that :ref:`coroutine support <coroutine-support>`
    is tested. (:issue:`4987`)

-   Migrated from Travis CI to GitHub Actions. (:issue:`4924`)

-   Fixed CI issues.
    (:issue:`4986`, :issue:`5020`, :issue:`5022`, :issue:`5027`, :issue:`5052`,
    :issue:`5053`)

-   Implemented code refactorings, style fixes and cleanups.
    (:issue:`4911`, :issue:`4982`, :issue:`5001`, :issue:`5002`, :issue:`5076`)

.. _release-2.4.1:

Scrapy 2.4.1 (2020-11-17)
-------------------------

-   Fixed :ref:`feed exports <topics-feed-exports>` overwrite support (:issue:`4845`, :issue:`4857`, :issue:`4859`)

-   Fixed the AsyncIO event loop handling, which could make code hang
    (:issue:`4855`, :issue:`4872`)

-   Fixed the IPv6-capable DNS resolver
    :class:`~scrapy.resolver.CachingHostnameResolver` for download handlers
    that call
    :meth:`reactor.resolve <twisted.internet.interfaces.IReactorCore.resolve>`
    (:issue:`4802`, :issue:`4803`)

-   Fixed the output of the :command:`genspider` command showing placeholders
    instead of the import path of the generated spider module (:issue:`4874`)

-   Migrated Windows CI from Azure Pipelines to GitHub Actions (:issue:`4869`,
    :issue:`4876`)

.. _release-2.4.0:

Scrapy 2.4.0 (2020-10-11)
-------------------------

Highlights:

*   Python 3.5 support has been dropped.

*   The ``file_path`` method of :ref:`media pipelines <topics-media-pipeline>`
    can now access the source :ref:`item <topics-items>`.

    This allows you to set a download file path based on item data.

*   The new ``item_export_kwargs`` key of the :setting:`FEEDS` setting allows
    to define keyword parameters to pass to :ref:`item exporter classes
    <topics-exporters>`

*   You can now choose whether :ref:`feed exports <topics-feed-exports>`
    overwrite or append to the output file.

    For example, when using the :command:`crawl` or :command:`runspider`
    commands, you can use the ``-O`` option instead of ``-o`` to overwrite the
    output file.

*   Zstd-compressed responses are now supported if zstandard_ is installed.

*   In settings, where the import path of a class is required, it is now
    possible to pass a class object instead.

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

*   Python 3.6 or greater is now required; support for Python 3.5 has been
    dropped

    As a result:

    -   When using PyPy, PyPy 7.2.0 or greater :ref:`is now required
        <faq-python-versions>`

    -   For Amazon S3 storage support in :ref:`feed exports
        <topics-feed-storage-s3>` or :ref:`media pipelines
        <media-pipelines-s3>`, botocore_ 1.4.87 or greater is now required

    -   To use the :ref:`images pipeline <images-pipeline>`, Pillow_ 4.0.0 or
        greater is now required

    (:issue:`4718`, :issue:`4732`, :issue:`4733`, :issue:`4742`, :issue:`4743`,
    :issue:`4764`)

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` once again
    discards cookies defined in :attr:`.Request.headers`.

    We decided to revert this bug fix, introduced in Scrapy 2.2.0, because it
    was reported that the current implementation could break existing code.

    If you need to set cookies for a request, use the :class:`Request.cookies
    <scrapy.Request>` parameter.

    A future version of Scrapy will include a new, better implementation of the
    reverted bug fix.

    (:issue:`4717`, :issue:`4823`)

.. _2.4-deprecation-removals:

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

*   :class:`scrapy.extensions.feedexport.S3FeedStorage` no longer reads the
    values of ``access_key`` and ``secret_key`` from the running project
    settings when they are not passed to its ``__init__`` method; you must
    either pass those parameters to its ``__init__`` method or use
    :class:`S3FeedStorage.from_crawler
    <scrapy.extensions.feedexport.S3FeedStorage.from_crawler>`
    (:issue:`4356`, :issue:`4411`, :issue:`4688`)

*   :attr:`Rule.process_request <scrapy.spiders.crawl.Rule.process_request>`
    no longer admits callables which expect a single ``request`` parameter,
    rather than both ``request`` and ``response`` (:issue:`4818`)

Deprecations
~~~~~~~~~~~~

*   In custom :ref:`media pipelines <topics-media-pipeline>`, signatures that
    do not accept a keyword-only ``item`` parameter in any of the  methods that
    :ref:`now support this parameter <media-pipeline-item-parameter>` are now
    deprecated (:issue:`4628`, :issue:`4686`)

*   In custom :ref:`feed storage backend classes <topics-feed-storage>`,
    ``__init__`` method signatures that do not accept a keyword-only
    ``feed_options`` parameter are now deprecated (:issue:`547`, :issue:`716`,
    :issue:`4512`)

*   The :class:`scrapy.utils.python.WeakKeyCache` class is now deprecated
    (:issue:`4684`, :issue:`4701`)

*   The :func:`scrapy.utils.boto.is_botocore` function is now deprecated, use
    :func:`scrapy.utils.boto.is_botocore_available` instead (:issue:`4734`,
    :issue:`4776`)

New features
~~~~~~~~~~~~

.. _media-pipeline-item-parameter:

*   The following methods of :ref:`media pipelines <topics-media-pipeline>` now
    accept an ``item`` keyword-only parameter containing the source
    :ref:`item <topics-items>`:

    -   In :class:`scrapy.pipelines.files.FilesPipeline`:

        -   :meth:`~scrapy.pipelines.files.FilesPipeline.file_downloaded`

        -   :meth:`~scrapy.pipelines.files.FilesPipeline.file_path`

        -   :meth:`~scrapy.pipelines.files.FilesPipeline.media_downloaded`

        -   :meth:`~scrapy.pipelines.files.FilesPipeline.media_to_download`

    -   In :class:`scrapy.pipelines.images.ImagesPipeline`:

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.file_downloaded`

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.file_path`

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.get_images`

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.image_downloaded`

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.media_downloaded`

        -   :meth:`~scrapy.pipelines.images.ImagesPipeline.media_to_download`

    (:issue:`4628`, :issue:`4686`)

*   The new ``item_export_kwargs`` key of the :setting:`FEEDS` setting allows
    to define keyword parameters to pass to :ref:`item exporter classes
    <topics-exporters>` (:issue:`4606`, :issue:`4768`)

*   :ref:`Feed exports <topics-feed-exports>` gained overwrite support:

    *   When using the :command:`crawl` or :command:`runspider` commands, you
        can use the ``-O`` option instead of ``-o`` to overwrite the output
        file

    *   You can use the ``overwrite`` key in the :setting:`FEEDS` setting to
        configure whether to overwrite the output file (``True``) or append to
        its content (``False``)

    *   The ``__init__`` and ``from_crawler`` methods of :ref:`feed storage
        backend classes <topics-feed-storage>` now receive a new keyword-only
        parameter, ``feed_options``, which is a dictionary of :ref:`feed
        options <feed-options>`

    (:issue:`547`, :issue:`716`, :issue:`4512`)

*   Zstd-compressed responses are now supported if zstandard_ is installed
    (:issue:`4831`)

*   In settings, where the import path of a class is required, it is now
    possible to pass a class object instead (:issue:`3870`, :issue:`3873`).

    This includes also settings where only part of its value is made of an
    import path, such as :setting:`DOWNLOADER_MIDDLEWARES` or
    :setting:`DOWNLOAD_HANDLERS`.

*   :ref:`Downloader middlewares <topics-downloader-middleware>` can now
    override :class:`response.request <scrapy.http.Response.request>`.

    If a :ref:`downloader middleware <topics-downloader-middleware>` returns
    a :class:`~scrapy.http.Response` object from
    :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response`
    or
    :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_exception`
    with a custom :class:`~scrapy.Request` object assigned to
    :class:`response.request <scrapy.http.Response.request>`:

    -   The response is handled by the callback of that custom
        :class:`~scrapy.Request` object, instead of being handled by the
        callback of the original :class:`~scrapy.Request` object

    -   That custom :class:`~scrapy.Request` object is now sent as the
        ``request`` argument to the :signal:`response_received` signal, instead
        of the original :class:`~scrapy.Request` object

    (:issue:`4529`, :issue:`4632`)

*   When using the :ref:`FTP feed storage backend <topics-feed-storage-ftp>`:

    -   It is now possible to set the new ``overwrite`` :ref:`feed option
        <feed-options>` to ``False`` to append to an existing file instead of
        overwriting it

    -   The FTP password can now be omitted if it is not necessary

    (:issue:`547`, :issue:`716`, :issue:`4512`)

*   The ``__init__`` method of :class:`~scrapy.exporters.CsvItemExporter` now
    supports an ``errors`` parameter to indicate how to handle encoding errors
    (:issue:`4755`)

*   When :ref:`using asyncio <using-asyncio>`, it is now possible to
    :ref:`set a custom asyncio loop <using-custom-loops>` (:issue:`4306`,
    :issue:`4414`)

*   Serialized requests (see :ref:`topics-jobs`) now support callbacks that are
    spider methods that delegate on other callable (:issue:`4756`)

*   When a response is larger than :setting:`DOWNLOAD_MAXSIZE`, the logged
    message is now a warning, instead of an error (:issue:`3874`,
    :issue:`3886`, :issue:`4752`)

Bug fixes
~~~~~~~~~

*   The :command:`genspider` command no longer overwrites existing files
    unless the ``--force`` option is used (:issue:`4561`, :issue:`4616`,
    :issue:`4623`)

*   Cookies with an empty value are no longer considered invalid cookies
    (:issue:`4772`)

*   The :command:`runspider` command now supports files with the ``.pyw`` file
    extension (:issue:`4643`, :issue:`4646`)

*   The :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`
    middleware now simply ignores unsupported proxy values (:issue:`3331`,
    :issue:`4778`)

*   Checks for generator callbacks with a ``return`` statement no longer warn
    about ``return`` statements in nested functions (:issue:`4720`,
    :issue:`4721`)

*   The system file mode creation mask no longer affects the permissions of
    files generated using the :command:`startproject` command (:issue:`4722`)

*   :func:`scrapy.utils.iterators.xmliter` now supports namespaced node names
    (:issue:`861`, :issue:`4746`)

*   :class:`~scrapy.Request` objects can now have ``about:`` URLs, which can
    work when using a headless browser (:issue:`4835`)

Documentation
~~~~~~~~~~~~~

*   The :setting:`FEED_URI_PARAMS` setting is now documented (:issue:`4671`,
    :issue:`4724`)

*   Improved the documentation of
    :ref:`link extractors <topics-link-extractors>` with an usage example from
    a spider callback and reference documentation for the
    :class:`~scrapy.link.Link` class (:issue:`4751`, :issue:`4775`)

*   Clarified the impact of :setting:`CONCURRENT_REQUESTS` when using the
    :class:`~scrapy.extensions.closespider.CloseSpider` extension
    (:issue:`4836`)

*   Removed references to Python 2’s ``unicode`` type (:issue:`4547`,
    :issue:`4703`)

*   We now have an :ref:`official deprecation policy <deprecation-policy>`
    (:issue:`4705`)

*   Our :ref:`documentation policies <documentation-policies>` now cover usage
    of Sphinx’s :rst:dir:`versionadded` and :rst:dir:`versionchanged`
    directives, and we have removed usages referencing Scrapy 1.4.0 and earlier
    versions (:issue:`3971`, :issue:`4310`)

*   Other documentation cleanups (:issue:`4090`, :issue:`4782`, :issue:`4800`,
    :issue:`4801`, :issue:`4809`, :issue:`4816`, :issue:`4825`)

Quality assurance
~~~~~~~~~~~~~~~~~

*   Extended typing hints (:issue:`4243`, :issue:`4691`)

*   Added tests for the :command:`check` command (:issue:`4663`)

*   Fixed test failures on Debian (:issue:`4726`, :issue:`4727`, :issue:`4735`)

*   Improved Windows test coverage (:issue:`4723`)

*   Switched to :ref:`formatted string literals <f-strings>` where possible
    (:issue:`4307`, :issue:`4324`, :issue:`4672`)

*   Modernized :func:`super` usage (:issue:`4707`)

*   Other code and test cleanups (:issue:`1790`, :issue:`3288`, :issue:`4165`,
    :issue:`4564`, :issue:`4651`, :issue:`4714`, :issue:`4738`, :issue:`4745`,
    :issue:`4747`, :issue:`4761`, :issue:`4765`, :issue:`4804`, :issue:`4817`,
    :issue:`4820`, :issue:`4822`, :issue:`4839`)

.. _release-2.3.0:

Scrapy 2.3.0 (2020-08-04)
-------------------------

Highlights:

*   :ref:`Feed exports <topics-feed-exports>` now support :ref:`Google Cloud
    Storage <topics-feed-storage-gcs>` as a storage backend

*   The new :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting allows to deliver
    output items in batches of up to the specified number of items.

    It also serves as a workaround for :ref:`delayed file delivery
    <delayed-file-delivery>`, which causes Scrapy to only start item delivery
    after the crawl has finished when using certain storage backends
    (:ref:`S3 <topics-feed-storage-s3>`, :ref:`FTP <topics-feed-storage-ftp>`,
    and now :ref:`GCS <topics-feed-storage-gcs>`).

*   The base implementation of :ref:`item loaders <topics-loaders>` has been
    moved into a separate library, :doc:`itemloaders <itemloaders:index>`,
    allowing usage from outside Scrapy and a separate release schedule

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

*   Removed the following classes and their parent modules from
    ``scrapy.linkextractors``:

    *   ``htmlparser.HtmlParserLinkExtractor``
    *   ``regex.RegexLinkExtractor``
    *   ``sgml.BaseSgmlLinkExtractor``
    *   ``sgml.SgmlLinkExtractor``

    Use
    :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    instead (:issue:`4356`, :issue:`4679`)

Deprecations
~~~~~~~~~~~~

*   The ``scrapy.utils.python.retry_on_eintr`` function is now deprecated
    (:issue:`4683`)

New features
~~~~~~~~~~~~

*   :ref:`Feed exports <topics-feed-exports>` support :ref:`Google Cloud
    Storage <topics-feed-storage-gcs>` (:issue:`685`, :issue:`3608`)

*   New :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting for batch deliveries
    (:issue:`4250`, :issue:`4434`)

*   The :command:`parse` command now allows specifying an output file
    (:issue:`4317`, :issue:`4377`)

*   :meth:`.Request.from_curl` and
    :func:`~scrapy.utils.curl.curl_to_request_kwargs` now also support
    ``--data-raw`` (:issue:`4612`)

*   A ``parse`` callback may now be used in built-in spider subclasses, such
    as :class:`~scrapy.spiders.CrawlSpider` (:issue:`712`, :issue:`732`,
    :issue:`781`, :issue:`4254` )

Bug fixes
~~~~~~~~~

*   Fixed the :ref:`CSV exporting <topics-feed-format-csv>` of
    :ref:`dataclass items <dataclass-items>` and :ref:`attr.s items
    <attrs-items>` (:issue:`4667`, :issue:`4668`)

*   :meth:`.Request.from_curl` and
    :func:`~scrapy.utils.curl.curl_to_request_kwargs` now set the request
    method to ``POST`` when a request body is specified and no request method
    is specified (:issue:`4612`)

*   The processing of ANSI escape sequences in enabled in Windows 10.0.14393
    and later, where it is required for colored output (:issue:`4393`,
    :issue:`4403`)

Documentation
~~~~~~~~~~~~~

*   Updated the `OpenSSL cipher list format`_ link in the documentation about
    the :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` setting (:issue:`4653`)

*   Simplified the code example in :ref:`topics-loaders-dataclass`
    (:issue:`4652`)

.. _OpenSSL cipher list format: https://docs.openssl.org/master/man1/openssl-ciphers/#cipher-list-format

Quality assurance
~~~~~~~~~~~~~~~~~

*   The base implementation of :ref:`item loaders <topics-loaders>` has been
    moved into :doc:`itemloaders <itemloaders:index>` (:issue:`4005`,
    :issue:`4516`)

*   Fixed a silenced error in some scheduler tests (:issue:`4644`,
    :issue:`4645`)

*   Renewed the localhost certificate used for SSL tests (:issue:`4650`)

*   Removed cookie-handling code specific to Python 2 (:issue:`4682`)

*   Stopped using Python 2 unicode literal syntax (:issue:`4704`)

*   Stopped using a backlash for line continuation (:issue:`4673`)

*   Removed unneeded entries from the MyPy exception list (:issue:`4690`)

*   Automated tests now pass on Windows as part of our continuous integration
    system (:issue:`4458`)

*   Automated tests now pass on the latest PyPy version for supported Python
    versions in our continuous integration system (:issue:`4504`)

.. _release-2.2.1:

Scrapy 2.2.1 (2020-07-17)
-------------------------

*   The :command:`startproject` command no longer makes unintended changes to
    the permissions of files in the destination folder, such as removing
    execution permissions (:issue:`4662`, :issue:`4666`)

.. _release-2.2.0:

Scrapy 2.2.0 (2020-06-24)
-------------------------

Highlights:

* Python 3.5.2+ is required now
* :ref:`dataclass objects <dataclass-items>` and
  :ref:`attrs objects <attrs-items>` are now valid :ref:`item types
  <item-types>`
* New :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method
* New :signal:`bytes_received` signal that allows canceling response download
* :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` fixes

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   Support for Python 3.5.0 and 3.5.1 has been dropped; Scrapy now refuses to
    run with a Python version lower than 3.5.2, which introduced
    :class:`typing.Type` (:issue:`4615`)

Deprecations
~~~~~~~~~~~~

*   ``TextResponse.body_as_unicode()`` is now deprecated, use
    :attr:`TextResponse.text <scrapy.http.TextResponse.text>` instead
    (:issue:`4546`, :issue:`4555`, :issue:`4579`)

*   :class:`scrapy.item.BaseItem` is now deprecated, use
    :class:`scrapy.item.Item` instead (:issue:`4534`)

New features
~~~~~~~~~~~~

*   :ref:`dataclass objects <dataclass-items>` and
    :ref:`attrs objects <attrs-items>` are now valid :ref:`item types
    <item-types>`, and a new itemadapter_ library makes it easy to
    write code that :ref:`supports any item type <supporting-item-types>`
    (:issue:`2749`, :issue:`2807`, :issue:`3761`, :issue:`3881`, :issue:`4642`)

*   A new :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method
    allows to deserialize JSON responses (:issue:`2444`, :issue:`4460`,
    :issue:`4574`)

*   A new :signal:`bytes_received` signal allows monitoring response download
    progress and :ref:`stopping downloads <topics-stop-response-download>`
    (:issue:`4205`, :issue:`4559`)

*   The dictionaries in the result list of a :ref:`media pipeline
    <topics-media-pipeline>` now include a new key, ``status``, which indicates
    if the file was downloaded or, if the file was not downloaded, why it was
    not downloaded; see :meth:`FilesPipeline.get_media_requests
    <scrapy.pipelines.files.FilesPipeline.get_media_requests>` for more
    information (:issue:`2893`, :issue:`4486`)

*   When using :ref:`Google Cloud Storage <media-pipeline-gcs>` for
    a :ref:`media pipeline <topics-media-pipeline>`, a warning is now logged if
    the configured credentials do not grant the required permissions
    (:issue:`4346`, :issue:`4508`)

*   :ref:`Link extractors <topics-link-extractors>` are now serializable,
    as long as you do not use :ref:`lambdas <lambda>` for parameters; for
    example, you can now pass link extractors in :attr:`.Request.cb_kwargs`
    or :attr:`.Request.meta` when :ref:`persisting
    scheduled requests <topics-jobs>` (:issue:`4554`)

*   Upgraded the :ref:`pickle protocol <pickle-protocols>` that Scrapy uses
    from protocol 2 to protocol 4, improving serialization capabilities and
    performance (:issue:`4135`, :issue:`4541`)

*   :func:`scrapy.utils.misc.create_instance` now raises a :exc:`TypeError`
    exception if the resulting instance is ``None`` (:issue:`4528`,
    :issue:`4532`)

.. _itemadapter: https://github.com/scrapy/itemadapter

Bug fixes
~~~~~~~~~

*   :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` no longer
    discards cookies defined in :attr:`Request.headers
    <scrapy.Request.headers>` (:issue:`1992`, :issue:`2400`)

*   :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` no longer
    re-encodes cookies defined as :class:`bytes` in the ``cookies`` parameter
    of the ``__init__`` method of :class:`~scrapy.Request`
    (:issue:`2400`, :issue:`3575`)

*   When :setting:`FEEDS` defines multiple URIs, :setting:`FEED_STORE_EMPTY` is
    ``False`` and the crawl yields no items, Scrapy no longer stops feed
    exports after the first URI (:issue:`4621`, :issue:`4626`)

*   :class:`~scrapy.spiders.Spider` callbacks defined using :doc:`coroutine
    syntax <topics/coroutines>` no longer need to return an iterable, and may
    instead return a :class:`~scrapy.Request` object, an
    :ref:`item <topics-items>`, or ``None`` (:issue:`4609`)

*   The :command:`startproject` command now ensures that the generated project
    folders and files have the right permissions (:issue:`4604`)

*   Fix a :exc:`KeyError` exception being sometimes raised from
    :class:`scrapy.utils.datatypes.LocalWeakReferencedCache` (:issue:`4597`,
    :issue:`4599`)

*   When :setting:`FEEDS` defines multiple URIs, log messages about items being
    stored now contain information from the corresponding feed, instead of
    always containing information about only one of the feeds (:issue:`4619`,
    :issue:`4629`)

Documentation
~~~~~~~~~~~~~

*   Added a new section about :ref:`accessing cb_kwargs from errbacks
    <errback-cb_kwargs>` (:issue:`4598`, :issue:`4634`)

*   Covered chompjs_ in :ref:`topics-parsing-javascript` (:issue:`4556`,
    :issue:`4562`)

*   Removed from :doc:`topics/coroutines` the warning about the API being
    experimental (:issue:`4511`, :issue:`4513`)

*   Removed references to unsupported versions of :doc:`Twisted
    <twisted:index>` (:issue:`4533`)

*   Updated the description of the :ref:`screenshot pipeline example
    <ScreenshotPipeline>`, which now uses :doc:`coroutine syntax
    <topics/coroutines>` instead of returning a
    :class:`~twisted.internet.defer.Deferred` (:issue:`4514`, :issue:`4593`)

*   Removed a misleading import line from the
    :func:`scrapy.utils.log.configure_logging` code example (:issue:`4510`,
    :issue:`4587`)

*   The display-on-hover behavior of internal documentation references now also
    covers links to :ref:`commands <topics-commands>`, :attr:`.Request.meta`
    keys, :ref:`settings <topics-settings>` and
    :ref:`signals <topics-signals>` (:issue:`4495`, :issue:`4563`)

*   It is again possible to download the documentation for offline reading
    (:issue:`4578`, :issue:`4585`)

*   Removed backslashes preceding ``*args`` and ``**kwargs`` in some function
    and method signatures (:issue:`4592`, :issue:`4596`)

.. _chompjs: https://github.com/Nykakin/chompjs

Quality assurance
~~~~~~~~~~~~~~~~~

*   Adjusted the code base further to our :ref:`style guidelines
    <coding-style>` (:issue:`4237`, :issue:`4525`, :issue:`4538`,
    :issue:`4539`, :issue:`4540`, :issue:`4542`, :issue:`4543`, :issue:`4544`,
    :issue:`4545`, :issue:`4557`, :issue:`4558`, :issue:`4566`, :issue:`4568`,
    :issue:`4572`)

*   Removed remnants of Python 2 support (:issue:`4550`, :issue:`4553`,
    :issue:`4568`)

*   Improved code sharing between the :command:`crawl` and :command:`runspider`
    commands (:issue:`4548`, :issue:`4552`)

*   Replaced ``chain(*iterable)`` with ``chain.from_iterable(iterable)``
    (:issue:`4635`)

*   You may now run the :mod:`asyncio` tests with Tox on any Python version
    (:issue:`4521`)

*   Updated test requirements to reflect an incompatibility with pytest 5.4 and
    5.4.1 (:issue:`4588`)

*   Improved :class:`~scrapy.spiderloader.SpiderLoader` test coverage for
    scenarios involving duplicate spider names (:issue:`4549`, :issue:`4560`)

*   Configured Travis CI to also run the tests with Python 3.5.2
    (:issue:`4518`, :issue:`4615`)

*   Added a `Pylint <https://www.pylint.org/>`_ job to Travis CI
    (:issue:`3727`)

*   Added a `Mypy <https://mypy-lang.org/>`_ job to Travis CI (:issue:`4637`)

*   Made use of set literals in tests (:issue:`4573`)

*   Cleaned up the Travis CI configuration (:issue:`4517`, :issue:`4519`,
    :issue:`4522`, :issue:`4537`)

.. _release-2.1.0:

Scrapy 2.1.0 (2020-04-24)
-------------------------

Highlights:

* New :setting:`FEEDS` setting to export to multiple feeds
* New :attr:`Response.ip_address <scrapy.http.Response.ip_address>` attribute

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   :exc:`AssertionError` exceptions triggered by :ref:`assert <assert>`
    statements have been replaced by new exception types, to support running
    Python in optimized mode (see :option:`-O`) without changing Scrapy’s
    behavior in any unexpected ways.

    If you catch an :exc:`AssertionError` exception from Scrapy, update your
    code to catch the corresponding new exception.

    (:issue:`4440`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

*   The ``LOG_UNSERIALIZABLE_REQUESTS`` setting is no longer supported, use
    :setting:`SCHEDULER_DEBUG` instead (:issue:`4385`)

*   The ``REDIRECT_MAX_METAREFRESH_DELAY`` setting is no longer supported, use
    :setting:`METAREFRESH_MAXDELAY` instead (:issue:`4385`)

*   The :class:`~scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware`
    middleware has been removed, including the entire
    :class:`scrapy.downloadermiddlewares.chunked` module; chunked transfers
    work out of the box (:issue:`4431`)

*   The ``spiders`` property has been removed from
    :class:`~scrapy.crawler.Crawler`, use :class:`CrawlerRunner.spider_loader
    <scrapy.crawler.CrawlerRunner.spider_loader>` or instantiate
    :setting:`SPIDER_LOADER_CLASS` with your settings instead (:issue:`4398`)

*   The ``MultiValueDict``, ``MultiValueDictKeyError``, and ``SiteNode``
    classes have been removed from :mod:`scrapy.utils.datatypes`
    (:issue:`4400`)

Deprecations
~~~~~~~~~~~~

*   The ``FEED_FORMAT`` and ``FEED_URI`` settings have been deprecated in
    favor of the new :setting:`FEEDS` setting (:issue:`1336`, :issue:`3858`,
    :issue:`4507`)

New features
~~~~~~~~~~~~

*   A new setting, :setting:`FEEDS`, allows configuring multiple output feeds
    with different settings each (:issue:`1336`, :issue:`3858`, :issue:`4507`)

*   The :command:`crawl` and :command:`runspider` commands now support multiple
    ``-o`` parameters (:issue:`1336`, :issue:`3858`, :issue:`4507`)

*   The :command:`crawl` and :command:`runspider` commands now support
    specifying an output format by appending ``:<format>`` to the output file
    (:issue:`1336`, :issue:`3858`, :issue:`4507`)

*   The new :attr:`Response.ip_address <scrapy.http.Response.ip_address>`
    attribute gives access to the IP address that originated a response
    (:issue:`3903`, :issue:`3940`)

*   A warning is now issued when a value in
    :attr:`~scrapy.spiders.Spider.allowed_domains` includes a port
    (:issue:`50`, :issue:`3198`, :issue:`4413`)

*   Zsh completion now excludes used option aliases from the completion list
    (:issue:`4438`)

Bug fixes
~~~~~~~~~

*   :ref:`Request serialization <request-serialization>` no longer breaks for
    callbacks that are spider attributes which are assigned a function with a
    different name (:issue:`4500`)

*   ``None`` values in :attr:`~scrapy.spiders.Spider.allowed_domains` no longer
    cause a :exc:`TypeError` exception (:issue:`4410`)

*   Zsh completion no longer allows options after arguments (:issue:`4438`)

*   zope.interface 5.0.0 and later versions are now supported
    (:issue:`4447`, :issue:`4448`)

*   ``Spider.make_requests_from_url``, deprecated in Scrapy 1.4.0, now issues a
    warning when used (:issue:`4412`)

Documentation
~~~~~~~~~~~~~

*   Improved the documentation about signals that allow their handlers to
    return a :class:`~twisted.internet.defer.Deferred` (:issue:`4295`,
    :issue:`4390`)

*   Our PyPI entry now includes links for our documentation, our source code
    repository and our issue tracker (:issue:`4456`)

*   Covered the `curl2scrapy <https://michael-shub.github.io/curl2scrapy/>`_
    service in the documentation (:issue:`4206`, :issue:`4455`)

*   Removed references to the Guppy library, which only works in Python 2
    (:issue:`4285`, :issue:`4343`)

*   Extended use of InterSphinx to link to Python 3 documentation
    (:issue:`4444`, :issue:`4445`)

*   Added support for Sphinx 3.0 and later (:issue:`4475`, :issue:`4480`,
    :issue:`4496`, :issue:`4503`)

Quality assurance
~~~~~~~~~~~~~~~~~

*   Removed warnings about using old, removed settings (:issue:`4404`)

*   Removed a warning about importing
    :class:`~twisted.internet.testing.StringTransport` from
    ``twisted.test.proto_helpers`` in Twisted 19.7.0 or newer (:issue:`4409`)

*   Removed outdated Debian package build files (:issue:`4384`)

*   Removed :class:`object` usage as a base class (:issue:`4430`)

*   Removed code that added support for old versions of Twisted that we no
    longer support (:issue:`4472`)

*   Fixed code style issues (:issue:`4468`, :issue:`4469`, :issue:`4471`,
    :issue:`4481`)

*   Removed :func:`twisted.internet.defer.returnValue` calls (:issue:`4443`,
    :issue:`4446`, :issue:`4489`)

.. _release-2.0.1:

Scrapy 2.0.1 (2020-03-18)
-------------------------

*   :meth:`Response.follow_all <scrapy.http.Response.follow_all>` now supports
    an empty URL iterable as input (:issue:`4408`, :issue:`4420`)

*   Removed top-level :mod:`~twisted.internet.reactor` imports to prevent
    errors about the wrong Twisted reactor being installed when setting a
    different Twisted reactor using :setting:`TWISTED_REACTOR` (:issue:`4401`,
    :issue:`4406`)

*   Fixed tests (:issue:`4422`)

.. _release-2.0.0:

Scrapy 2.0.0 (2020-03-03)
-------------------------

Highlights:

* Python 2 support has been removed
* :doc:`Partial <topics/coroutines>` :ref:`coroutine syntax <async>` support
  and :doc:`experimental <topics/asyncio>` :mod:`asyncio` support
* New :meth:`Response.follow_all <scrapy.http.Response.follow_all>` method
* :ref:`FTP support <media-pipeline-ftp>` for media pipelines
* New :attr:`Response.certificate <scrapy.http.Response.certificate>`
  attribute
* IPv6 support through ``DNS_RESOLVER``

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   Python 2 support has been removed, following `Python 2 end-of-life on
    January 1, 2020`_ (:issue:`4091`, :issue:`4114`, :issue:`4115`,
    :issue:`4121`, :issue:`4138`, :issue:`4231`, :issue:`4242`, :issue:`4304`,
    :issue:`4309`, :issue:`4373`)

*   Retry gaveups (see :setting:`RETRY_TIMES`) are now logged as errors instead
    of as debug information (:issue:`3171`, :issue:`3566`)

*   File extensions that
    :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    ignores by default now also include ``7z``, ``7zip``, ``apk``, ``bz2``,
    ``cdr``, ``dmg``, ``ico``, ``iso``, ``tar``, ``tar.gz``, ``webm``, and
    ``xz`` (:issue:`1837`, :issue:`2067`, :issue:`4066`)

*   The :setting:`METAREFRESH_IGNORE_TAGS` setting is now an empty list by
    default, following web browser behavior (:issue:`3844`, :issue:`4311`)

*   The
    :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`
    now includes spaces after commas in the value of the ``Accept-Encoding``
    header that it sets, following web browser behavior (:issue:`4293`)

*   The ``__init__`` method of custom download handlers (see
    :setting:`DOWNLOAD_HANDLERS`) or subclasses of the following downloader
    handlers  no longer receives a ``settings`` parameter:

    *   :class:`scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler`

    *   :class:`scrapy.core.downloader.handlers.file.FileDownloadHandler`

    Use the ``from_settings`` or ``from_crawler`` class methods to expose such
    a parameter to your custom download handlers.

    (:issue:`4126`)

*   We have refactored the :class:`scrapy.core.scheduler.Scheduler` class and
    related queue classes (see :setting:`SCHEDULER_PRIORITY_QUEUE`,
    :setting:`SCHEDULER_DISK_QUEUE` and :setting:`SCHEDULER_MEMORY_QUEUE`) to
    make it easier to implement custom scheduler queue classes. See
    :ref:`2-0-0-scheduler-queue-changes` below for details.

*   Overridden settings are now logged in a different format. This is more in
    line with similar information logged at startup (:issue:`4199`)

.. _Python 2 end-of-life on January 1, 2020: https://www.python.org/doc/sunset-python-2/

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

*   The :ref:`Scrapy shell <topics-shell>` no longer provides a `sel` proxy
    object, use :meth:`response.selector <scrapy.http.TextResponse.selector>`
    instead (:issue:`4347`)

*   LevelDB support has been removed (:issue:`4112`)

*   The following functions have been removed from :mod:`scrapy.utils.python`:
    ``isbinarytext``, ``is_writable``, ``setattr_default``, ``stringify_dict``
    (:issue:`4362`)

Deprecations
~~~~~~~~~~~~

*   Using environment variables prefixed with ``SCRAPY_`` to override settings
    is deprecated (:issue:`4300`, :issue:`4374`, :issue:`4375`)

*   :class:`scrapy.linkextractors.FilteringLinkExtractor` is deprecated, use
    :class:`scrapy.linkextractors.LinkExtractor
    <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` instead (:issue:`4045`)

*   The ``noconnect`` query string argument of proxy URLs is deprecated and
    should be removed from proxy URLs (:issue:`4198`)

*   The :meth:`next <scrapy.utils.python.MutableChain.next>` method of
    :class:`scrapy.utils.python.MutableChain` is deprecated, use the global
    :func:`next` function or :meth:`MutableChain.__next__
    <scrapy.utils.python.MutableChain.__next__>` instead (:issue:`4153`)

New features
~~~~~~~~~~~~

*   Added :doc:`partial support <topics/coroutines>` for Python’s
    :ref:`coroutine syntax <async>` and :doc:`experimental support
    <topics/asyncio>` for :mod:`asyncio` and :mod:`asyncio`-powered libraries
    (:issue:`4010`, :issue:`4259`, :issue:`4269`, :issue:`4270`, :issue:`4271`,
    :issue:`4316`, :issue:`4318`)

*   The new :meth:`Response.follow_all <scrapy.http.Response.follow_all>`
    method offers the same functionality as
    :meth:`Response.follow <scrapy.http.Response.follow>` but supports an
    iterable of URLs as input and returns an iterable of requests
    (:issue:`2582`, :issue:`4057`, :issue:`4286`)

*   :ref:`Media pipelines <topics-media-pipeline>` now support :ref:`FTP
    storage <media-pipeline-ftp>` (:issue:`3928`, :issue:`3961`)

*   The new :attr:`Response.certificate <scrapy.http.Response.certificate>`
    attribute exposes the SSL certificate of the server as a
    :class:`twisted.internet.ssl.Certificate` object for HTTPS responses
    (:issue:`2726`, :issue:`4054`)

*   A new ``DNS_RESOLVER`` setting allows enabling IPv6 support
    (:issue:`1031`, :issue:`4227`)

*   A new :setting:`SCRAPER_SLOT_MAX_ACTIVE_SIZE` setting allows configuring
    the existing soft limit that pauses request downloads when the total
    response data being processed is too high (:issue:`1410`, :issue:`3551`)

*   A new :setting:`TWISTED_REACTOR` setting allows customizing the
    :mod:`~twisted.internet.reactor` that Scrapy uses, allowing to
    :doc:`enable asyncio support <topics/asyncio>` or deal with a
    :ref:`common macOS issue <faq-specific-reactor>` (:issue:`2905`,
    :issue:`4294`)

*   Scheduler disk and memory queues may now use the class methods
    ``from_crawler`` or ``from_settings`` (:issue:`3884`)

*   The new :attr:`Response.cb_kwargs <scrapy.http.Response.cb_kwargs>`
    attribute serves as a shortcut for :attr:`Response.request.cb_kwargs
    <scrapy.Request.cb_kwargs>` (:issue:`4331`)

*   :meth:`Response.follow <scrapy.http.Response.follow>` now supports a
    ``flags`` parameter, for consistency with :class:`~scrapy.Request`
    (:issue:`4277`, :issue:`4279`)

*   :ref:`Item loader processors <topics-loaders-processors>` can now be
    regular functions, they no longer need to be methods (:issue:`3899`)

*   :class:`~scrapy.spiders.Rule` now accepts an ``errback`` parameter
    (:issue:`4000`)

*   :class:`~scrapy.Request` no longer requires a ``callback`` parameter
    when an ``errback`` parameter is specified (:issue:`3586`, :issue:`4008`)

*   :class:`~scrapy.logformatter.LogFormatter` now supports some additional
    methods:

    *   :class:`~scrapy.logformatter.LogFormatter.download_error` for
        download errors

    *   :class:`~scrapy.logformatter.LogFormatter.item_error` for exceptions
        raised during item processing by :ref:`item pipelines
        <topics-item-pipeline>`

    *   :class:`~scrapy.logformatter.LogFormatter.spider_error` for exceptions
        raised from :ref:`spider callbacks <topics-spiders>`

    (:issue:`374`, :issue:`3986`, :issue:`3989`, :issue:`4176`, :issue:`4188`)

*   The :setting:`FEED_URI` setting now supports :class:`pathlib.Path` values
    (:issue:`3731`, :issue:`4074`)

*   A new :signal:`request_left_downloader` signal is sent when a request
    leaves the downloader (:issue:`4303`)

*   Scrapy logs a warning when it detects a request callback or errback that
    uses ``yield`` but also returns a value, since the returned value would be
    lost (:issue:`3484`, :issue:`3869`)

*   :class:`~scrapy.spiders.Spider` objects now raise an :exc:`AttributeError`
    exception if they do not have a :class:`~scrapy.spiders.Spider.start_urls`
    attribute nor reimplement ``scrapy.spiders.Spider.start_requests()``,
    but have a ``start_url`` attribute (:issue:`4133`, :issue:`4170`)

*   :class:`~scrapy.exporters.BaseItemExporter` subclasses may now use
    ``super().__init__(**kwargs)`` instead of ``self._configure(kwargs)`` in
    their ``__init__`` method, passing ``dont_fail=True`` to the parent
    ``__init__`` method if needed, and accessing ``kwargs`` at ``self._kwargs``
    after calling their parent ``__init__`` method (:issue:`4193`,
    :issue:`4370`)

*   A new ``keep_fragments`` parameter of
    ``scrapy.utils.request.request_fingerprint`` allows to generate
    different fingerprints for requests with different fragments in their URL
    (:issue:`4104`)

*   Download handlers (see :setting:`DOWNLOAD_HANDLERS`) may now use the
    ``from_settings`` and ``from_crawler`` class methods that other Scrapy
    components already supported (:issue:`4126`)

*   :class:`scrapy.utils.python.MutableChain.__iter__` now returns ``self``,
    allowing it to be used as a sequence.
    (:issue:`4153`)

Bug fixes
~~~~~~~~~

*   The :command:`crawl` command now also exits with exit code 1 when an
    exception happens before the crawling starts (:issue:`4175`, :issue:`4207`)

*   :class:`LinkExtractor.extract_links
    <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links>` no longer
    re-encodes the query string or URLs from non-UTF-8 responses in UTF-8
    (:issue:`998`, :issue:`1403`, :issue:`1949`, :issue:`4321`)

*   The first spider middleware (see :setting:`SPIDER_MIDDLEWARES`) now also
    processes exceptions raised from callbacks that are generators
    (:issue:`4260`, :issue:`4272`)

*   Redirects to URLs starting with 3 slashes (``///``) are now supported
    (:issue:`4032`, :issue:`4042`)

*   :class:`~scrapy.Request` no longer accepts strings as ``url`` simply
    because they have a colon (:issue:`2552`, :issue:`4094`)

*   The correct encoding is now used for attach names in
    :class:`~scrapy.mail.MailSender` (:issue:`4229`, :issue:`4239`)

*   :class:`~scrapy.dupefilters.RFPDupeFilter`, the default
    :setting:`DUPEFILTER_CLASS`, no longer writes an extra ``\r`` character on
    each line in Windows, which made the size of the ``requests.seen`` file
    unnecessarily large on that platform (:issue:`4283`)

*   Z shell auto-completion now looks for ``.html`` files, not ``.http`` files,
    and covers the ``-h`` command-line switch (:issue:`4122`, :issue:`4291`)

*   Adding items to a :class:`scrapy.utils.datatypes.LocalCache` object
    without a ``limit`` defined no longer raises a :exc:`TypeError` exception
    (:issue:`4123`)

*   Fixed a typo in the message of the :exc:`ValueError` exception raised when
    :func:`scrapy.utils.misc.create_instance` gets both ``settings`` and
    ``crawler`` set to ``None`` (:issue:`4128`)

Documentation
~~~~~~~~~~~~~

*   API documentation now links to an online, syntax-highlighted view of the
    corresponding source code (:issue:`4148`)

*   Links to unexisting documentation pages now allow access to the sidebar
    (:issue:`4152`, :issue:`4169`)

*   Cross-references within our documentation now display a tooltip when
    hovered (:issue:`4173`, :issue:`4183`)

*   Improved the documentation about :meth:`LinkExtractor.extract_links
    <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links>` and
    simplified :ref:`topics-link-extractors` (:issue:`4045`)

*   Clarified how :class:`ItemLoader.item <scrapy.loader.ItemLoader.item>`
    works (:issue:`3574`, :issue:`4099`)

*   Clarified that :func:`logging.basicConfig` should not be used when also
    using :class:`~scrapy.crawler.CrawlerProcess` (:issue:`2149`,
    :issue:`2352`, :issue:`3146`, :issue:`3960`)

*   Clarified the requirements for :class:`~scrapy.Request` objects
    :ref:`when using persistence <request-serialization>` (:issue:`4124`,
    :issue:`4139`)

*   Clarified how to install a :ref:`custom image pipeline
    <media-pipeline-example>` (:issue:`4034`, :issue:`4252`)

*   Fixed the signatures of the ``file_path`` method in :ref:`media pipeline
    <topics-media-pipeline>` examples (:issue:`4290`)

*   Covered a backward-incompatible change in Scrapy 1.7.0 affecting custom
    :class:`scrapy.core.scheduler.Scheduler` subclasses (:issue:`4274`)

*   Improved the ``README.rst`` and ``CODE_OF_CONDUCT.md`` files
    (:issue:`4059`)

*   Documentation examples are now checked as part of our test suite and we
    have fixed some of the issues detected (:issue:`4142`, :issue:`4146`,
    :issue:`4171`, :issue:`4184`, :issue:`4190`)

*   Fixed logic issues, broken links and typos (:issue:`4247`, :issue:`4258`,
    :issue:`4282`, :issue:`4288`, :issue:`4305`, :issue:`4308`, :issue:`4323`,
    :issue:`4338`, :issue:`4359`, :issue:`4361`)

*   Improved consistency when referring to the ``__init__`` method of an object
    (:issue:`4086`, :issue:`4088`)

*   Fixed an inconsistency between code and output in :ref:`intro-overview`
    (:issue:`4213`)

*   Extended :mod:`~sphinx.ext.intersphinx` usage (:issue:`4147`,
    :issue:`4172`, :issue:`4185`, :issue:`4194`, :issue:`4197`)

*   We now use a recent version of Python to build the documentation
    (:issue:`4140`, :issue:`4249`)

*   Cleaned up documentation (:issue:`4143`, :issue:`4275`)

Quality assurance
~~~~~~~~~~~~~~~~~

*   Re-enabled proxy ``CONNECT`` tests (:issue:`2545`, :issue:`4114`)

*   Added Bandit_ security checks to our test suite (:issue:`4162`,
    :issue:`4181`)

*   Added Flake8_ style checks to our test suite and applied many of the
    corresponding changes (:issue:`3944`, :issue:`3945`, :issue:`4137`,
    :issue:`4157`, :issue:`4167`, :issue:`4174`, :issue:`4186`, :issue:`4195`,
    :issue:`4238`, :issue:`4246`, :issue:`4355`, :issue:`4360`, :issue:`4365`)

*   Improved test coverage (:issue:`4097`, :issue:`4218`, :issue:`4236`)

*   Started reporting slowest tests, and improved the performance of some of
    them (:issue:`4163`, :issue:`4164`)

*   Fixed broken tests and refactored some tests (:issue:`4014`, :issue:`4095`,
    :issue:`4244`, :issue:`4268`, :issue:`4372`)

*   Modified the :doc:`tox <tox:index>` configuration to allow running tests
    with any Python version, run Bandit_ and Flake8_ tests by default, and
    enforce a minimum tox version programmatically (:issue:`4179`)

*   Cleaned up code (:issue:`3937`, :issue:`4208`, :issue:`4209`,
    :issue:`4210`, :issue:`4212`, :issue:`4369`, :issue:`4376`, :issue:`4378`)

.. _Bandit: https://bandit.readthedocs.io/en/latest/
.. _Flake8: https://flake8.pycqa.org/en/latest/

.. _2-0-0-scheduler-queue-changes:

Changes to scheduler queue classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following changes may impact any custom queue classes of all types:

*   The ``push`` method no longer receives a second positional parameter
    containing ``request.priority * -1``. If you need that value, get it
    from the first positional parameter, ``request``, instead, or use
    the new :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.priority`
    method in :class:`scrapy.core.scheduler.ScrapyPriorityQueue`
    subclasses.

The following changes may impact custom priority queue classes:

*   In the ``__init__`` method or the ``from_crawler`` or ``from_settings``
    class methods:

    *   The parameter that used to contain a factory function,
        ``qfactory``, is now passed as a keyword parameter named
        ``downstream_queue_cls``.

    *   A new keyword parameter has been added: ``key``. It is a string
        that is always an empty string for memory queues and indicates the
        :setting:`JOB_DIR` value for disk queues.

    *   The parameter for disk queues that contains data from the previous
        crawl, ``startprios`` or ``slot_startprios``, is now passed as a
        keyword parameter named ``startprios``.

    *   The ``serialize`` parameter is no longer passed. The disk queue
        class must take care of request serialization on its own before
        writing to disk, using the
        :func:`~scrapy.utils.reqser.request_to_dict` and
        :func:`~scrapy.utils.reqser.request_from_dict` functions from the
        :mod:`scrapy.utils.reqser` module.

The following changes may impact custom disk and memory queue classes:

*   The signature of the ``__init__`` method is now
    ``__init__(self, crawler, key)``.

The following changes affect specifically the
:class:`~scrapy.core.scheduler.ScrapyPriorityQueue` and
:class:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue` classes from
:mod:`scrapy.core.scheduler` and may affect subclasses:

*   In the ``__init__`` method, most of the changes described above apply.

    ``__init__`` may still receive all parameters as positional parameters,
    however:

    *   ``downstream_queue_cls``, which replaced ``qfactory``, must be
        instantiated differently.

        ``qfactory`` was instantiated with a priority value (integer).

        Instances of ``downstream_queue_cls`` should be created using
        the new
        :meth:`ScrapyPriorityQueue.qfactory <scrapy.core.scheduler.ScrapyPriorityQueue.qfactory>`
        or
        :meth:`DownloaderAwarePriorityQueue.pqfactory <scrapy.core.scheduler.DownloaderAwarePriorityQueue.pqfactory>`
        methods.

    *   The new ``key`` parameter displaced the ``startprios``
        parameter 1 position to the right.

*   The following class attributes have been added:

    *   :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.crawler`

    *   :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.downstream_queue_cls`
        (details above)

    *   :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.key` (details above)

*   The ``serialize`` attribute has been removed (details above)

The following changes affect specifically the
:class:`~scrapy.core.scheduler.ScrapyPriorityQueue` class and may affect
subclasses:

*   A new :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.priority`
    method has been added which, given a request, returns
    ``request.priority * -1``.

    It is used in :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.push`
    to make up for the removal of its ``priority`` parameter.

*   The ``spider`` attribute has been removed. Use
    :attr:`crawler.spider <scrapy.core.scheduler.ScrapyPriorityQueue.crawler>`
    instead.

The following changes affect specifically the
:class:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue` class and may
affect subclasses:

*   A new :attr:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue.pqueues`
    attribute offers a mapping of downloader slot names to the
    corresponding instances of
    :attr:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue.downstream_queue_cls`.

(:issue:`3884`)

.. _release-1.8.4:

Scrapy 1.8.4 (2024-02-14)
-------------------------

**Security bug fixes:**

-   Due to its `ReDoS vulnerabilities`_, ``scrapy.utils.iterators.xmliter`` is
    now deprecated in favor of :func:`~scrapy.utils.iterators.xmliter_lxml`,
    which :class:`~scrapy.spiders.XMLFeedSpider` now uses.

    To minimize the impact of this change on existing code,
    :func:`~scrapy.utils.iterators.xmliter_lxml` now supports indicating
    the node namespace as a prefix in the node name, and big files with highly
    nested trees when using libxml2 2.7+.

    Please, see the `cc65-xxvf-f7r9 security advisory`_ for more information.

-   :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
    to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
    advisory`_ for more information.

-   Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, use of the
    ``scrapy.downloadermiddlewares.decompression`` module is discouraged and
    will trigger a warning.

-   The ``Authorization`` header is now dropped on redirects to a different
    domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
    information.

    .. _cw9j-q3vf-hrrv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cw9j-q3vf-hrrv

.. _release-1.8.3:

Scrapy 1.8.3 (2022-07-25)
-------------------------

**Security bug fix:**

-   When :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`
    processes a request with :reqmeta:`proxy` metadata, and that
    :reqmeta:`proxy` metadata includes proxy credentials,
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` sets
    the ``Proxy-Authorization`` header, but only if that header is not already
    set.

    There are third-party proxy-rotation downloader middlewares that set
    different :reqmeta:`proxy` metadata every time they process a request.

    Because of request retries and redirects, the same request can be processed
    by downloader middlewares more than once, including both
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` and
    any third-party proxy-rotation downloader middleware.

    These third-party proxy-rotation downloader middlewares could change the
    :reqmeta:`proxy` metadata of a request to a new value, but fail to remove
    the ``Proxy-Authorization`` header from the previous value of the
    :reqmeta:`proxy` metadata, causing the credentials of one proxy to be sent
    to a different proxy.

    To prevent the unintended leaking of proxy credentials, the behavior of
    :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` is now
    as follows when processing a request:

    -   If the request being processed defines :reqmeta:`proxy` metadata that
        includes credentials, the ``Proxy-Authorization`` header is always
        updated to feature those credentials.

    -   If the request being processed defines :reqmeta:`proxy` metadata
        without credentials, the ``Proxy-Authorization`` header is removed
        *unless* it was originally defined for the same proxy URL.

        To remove proxy credentials while keeping the same proxy URL, remove
        the ``Proxy-Authorization`` header.

    -   If the request has no :reqmeta:`proxy` metadata, or that metadata is a
        falsy value (e.g. ``None``), the ``Proxy-Authorization`` header is
        removed.

        It is no longer possible to set a proxy URL through the
        :reqmeta:`proxy` metadata but set the credentials through the
        ``Proxy-Authorization`` header. Set proxy credentials through the
        :reqmeta:`proxy` metadata instead.

.. _release-1.8.2:

Scrapy 1.8.2 (2022-03-01)
-------------------------

**Security bug fixes:**

-   When a :class:`~scrapy.Request` object with cookies defined gets a
    redirect response causing a new :class:`~scrapy.Request` object to be
    scheduled, the cookies defined in the original
    :class:`~scrapy.Request` object are no longer copied into the new
    :class:`~scrapy.Request` object.

    If you manually set the ``Cookie`` header on a
    :class:`~scrapy.Request` object and the domain name of the redirect
    URL is not an exact match for the domain of the URL of the original
    :class:`~scrapy.Request` object, your ``Cookie`` header is now dropped
    from the new :class:`~scrapy.Request` object.

    The old behavior could be exploited by an attacker to gain access to your
    cookies. Please, see the `cjvr-mfj7-j4j8 security advisory`_ for more
    information.

    .. _cjvr-mfj7-j4j8 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cjvr-mfj7-j4j8

    .. note:: It is still possible to enable the sharing of cookies between
              different domains with a shared domain suffix (e.g.
              ``example.com`` and any subdomain) by defining the shared domain
              suffix (e.g. ``example.com``) as the cookie domain when defining
              your cookies. See the documentation of the
              :class:`~scrapy.Request` class for more information.

-   When the domain of a cookie, either received in the ``Set-Cookie`` header
    of a response or defined in a :class:`~scrapy.Request` object, is set
    to a `public suffix <https://publicsuffix.org/>`_, the cookie is now
    ignored unless the cookie domain is the same as the request domain.

    The old behavior could be exploited by an attacker to inject cookies into
    your requests to some other domains. Please, see the `mfjm-vh54-3f96
    security advisory`_ for more information.

    .. _mfjm-vh54-3f96 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-mfjm-vh54-3f96

.. _release-1.8.1:

Scrapy 1.8.1 (2021-10-05)
-------------------------

*   **Security bug fix:**

    If you use
    :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`
    (i.e. the ``http_user`` and ``http_pass`` spider attributes) for HTTP
    authentication, any request exposes your credentials to the request target.

    To prevent unintended exposure of authentication credentials to unintended
    domains, you must now additionally set a new, additional spider attribute,
    ``http_auth_domain``, and point it to the specific domain to which the
    authentication credentials must be sent.

    If the ``http_auth_domain`` spider attribute is not set, the domain of the
    first request will be considered the HTTP authentication target, and
    authentication credentials will only be sent in requests targeting that
    domain.

    If you need to send the same HTTP authentication credentials to multiple
    domains, you can use :func:`w3lib.http.basic_auth_header` instead to
    set the value of the ``Authorization`` header of your requests.

    If you *really* want your spider to send the same HTTP authentication
    credentials to any domain, set the ``http_auth_domain`` spider attribute
    to ``None``.

    Finally, if you are a user of `scrapy-splash`_, know that this version of
    Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will
    need to upgrade scrapy-splash to a greater version for it to continue to
    work.

.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash

.. _release-1.8.0:

Scrapy 1.8.0 (2019-10-28)
-------------------------

Highlights:

* Dropped Python 3.4 support and updated minimum requirements; made Python 3.8
  support official
* New :meth:`.Request.from_curl` class method
* New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings
* New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and
  :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. skip: start

*   Python 3.4 is no longer supported, and some of the minimum requirements of
    Scrapy have also changed:

    *   :doc:`cssselect <cssselect:index>` 0.9.1
    *   cryptography_ 2.0
    *   lxml_ 3.5.0
    *   pyOpenSSL_ 16.2.0
    *   queuelib_ 1.4.2
    *   service_identity_ 16.0.0
    *   six_ 1.10.0
    *   Twisted_ 17.9.0 (16.0.0 with Python 2)
    *   zope.interface_ 4.1.3

    (:issue:`3892`)

*   ``JSONRequest`` is now called :class:`~scrapy.http.JsonRequest` for
    consistency with similar classes (:issue:`3929`, :issue:`3982`)

*   If you are using a custom context factory
    (``DOWNLOADER_CLIENTCONTEXTFACTORY``), its ``__init__`` method must
    accept two new parameters: ``tls_verbose_logging`` and ``tls_ciphers``
    (:issue:`2111`, :issue:`3392`, :issue:`3442`, :issue:`3450`)

*   :class:`~scrapy.loader.ItemLoader` now turns the values of its input item
    into lists:

    .. code-block:: pycon

        >>> item = MyItem()
        >>> item["field"] = "value1"
        >>> loader = ItemLoader(item=item)
        >>> item["field"]
        ['value1']

    This is needed to allow adding values to existing fields
    (``loader.add_value('field', 'value2')``).

    (:issue:`3804`, :issue:`3819`, :issue:`3897`, :issue:`3976`, :issue:`3998`,
    :issue:`4036`)

.. skip: end

See also :ref:`1.8-deprecation-removals` below.

New features
~~~~~~~~~~~~

*   A new :meth:`Request.from_curl <scrapy.Request.from_curl>` class
    method allows :ref:`creating a request from a cURL command
    <requests-from-curl>` (:issue:`2985`, :issue:`3862`)

*   A new :setting:`ROBOTSTXT_PARSER` setting allows choosing which robots.txt_
    parser to use. It includes built-in support for
    :ref:`RobotFileParser <python-robotfileparser>`,
    :ref:`Protego <protego-parser>` (default), Reppy, and
    :ref:`Robotexclusionrulesparser <rerp-parser>`, and allows you to
    :ref:`implement support for additional parsers
    <support-for-new-robots-parser>` (:issue:`754`, :issue:`2669`,
    :issue:`3796`, :issue:`3935`, :issue:`3969`, :issue:`4006`)

*   A new :setting:`ROBOTSTXT_USER_AGENT` setting allows defining a separate
    user agent string to use for robots.txt_ parsing (:issue:`3931`,
    :issue:`3966`)

*   :class:`~scrapy.spiders.Rule` no longer requires a :class:`LinkExtractor
    <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` parameter
    (:issue:`781`, :issue:`4016`)

*   Use the new :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` setting to customize
    the TLS/SSL ciphers used by the default HTTP/1.1 downloader (:issue:`3392`,
    :issue:`3442`)

*   Set the new :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` setting to
    ``True`` to enable debug-level messages about TLS connection parameters
    after establishing HTTPS connections (:issue:`2111`, :issue:`3450`)

*   Callbacks that receive keyword arguments (see :attr:`.Request.cb_kwargs`)
    can now be tested using the new :class:`@cb_kwargs
    <scrapy.contracts.default.CallbackKeywordArgumentsContract>`
    :ref:`spider contract <topics-contracts>` (:issue:`3985`, :issue:`3988`)

*   When a :class:`@scrapes <scrapy.contracts.default.ScrapesContract>` spider
    contract fails, all missing fields are now reported (:issue:`766`,
    :issue:`3939`)

*   :ref:`Custom log formats <custom-log-formats>` can now drop messages by
    having the corresponding methods of the configured :setting:`LOG_FORMATTER`
    return ``None`` (:issue:`3984`, :issue:`3987`)

*   A much improved completion definition is now available for Zsh_
    (:issue:`4069`)

Bug fixes
~~~~~~~~~

*   :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` no
    longer makes later calls to :meth:`ItemLoader.get_output_value()
    <scrapy.loader.ItemLoader.get_output_value>` or
    :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` return
    empty data (:issue:`3804`, :issue:`3819`, :issue:`3897`, :issue:`3976`,
    :issue:`3998`, :issue:`4036`)

*   Fixed :class:`~scrapy.statscollectors.DummyStatsCollector` raising a
    :exc:`TypeError` exception (:issue:`4007`, :issue:`4052`)

*   :meth:`FilesPipeline.file_path
    <scrapy.pipelines.files.FilesPipeline.file_path>` and
    :meth:`ImagesPipeline.file_path
    <scrapy.pipelines.images.ImagesPipeline.file_path>` no longer choose
    file extensions that are not `registered with IANA`_ (:issue:`1287`,
    :issue:`3953`, :issue:`3954`)

*   When using botocore_ to persist files in S3, all botocore-supported headers
    are properly mapped now (:issue:`3904`, :issue:`3905`)

*   FTP passwords in :setting:`FEED_URI` containing percent-escaped characters
    are now properly decoded (:issue:`3941`)

*   A memory-handling and error-handling issue in
    :func:`scrapy.utils.ssl.get_temp_key_info` has been fixed (:issue:`3920`)

Documentation
~~~~~~~~~~~~~

*   The documentation now covers how to define and configure a :ref:`custom log
    format <custom-log-formats>` (:issue:`3616`, :issue:`3660`)

*   API documentation added for :class:`~scrapy.exporters.MarshalItemExporter`
    and :class:`~scrapy.exporters.PythonItemExporter` (:issue:`3973`)

*   API documentation added for :class:`~scrapy.item.BaseItem` and
    :class:`~scrapy.item.ItemMeta` (:issue:`3999`)

*   Minor documentation fixes (:issue:`2998`, :issue:`3398`, :issue:`3597`,
    :issue:`3894`, :issue:`3934`, :issue:`3978`, :issue:`3993`, :issue:`4022`,
    :issue:`4028`, :issue:`4033`, :issue:`4046`, :issue:`4050`, :issue:`4055`,
    :issue:`4056`, :issue:`4061`, :issue:`4072`, :issue:`4071`, :issue:`4079`,
    :issue:`4081`, :issue:`4089`, :issue:`4093`)

.. _1.8-deprecation-removals:

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

*   ``scrapy.xlib`` has been removed (:issue:`4015`)

.. _1.8-deprecations:

Deprecations
~~~~~~~~~~~~

*   The LevelDB_ storage backend
    (``scrapy.extensions.httpcache.LeveldbCacheStorage``) of
    :class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware` is
    deprecated (:issue:`4085`, :issue:`4092`)

*   Use of the undocumented ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` environment
    variable is deprecated (:issue:`3910`)

*   ``scrapy.item.DictItem`` is deprecated, use :class:`~scrapy.item.Item`
    instead (:issue:`3999`)

Other changes
~~~~~~~~~~~~~

*   Minimum versions of optional Scrapy requirements that are covered by
    continuous integration tests have been updated:

    *   botocore_ 1.3.23
    *   Pillow_ 3.4.2

    Lower versions of these optional requirements may work, but it is not
    guaranteed (:issue:`3892`)

*   GitHub templates for bug reports and feature requests (:issue:`3126`,
    :issue:`3471`, :issue:`3749`, :issue:`3754`)

*   Continuous integration fixes (:issue:`3923`)

*   Code cleanup (:issue:`3391`, :issue:`3907`, :issue:`3946`, :issue:`3950`,
    :issue:`4023`, :issue:`4031`)

.. _release-1.7.4:

Scrapy 1.7.4 (2019-10-21)
-------------------------

Revert the fix for :issue:`3804` (:issue:`3819`), which has a few undesired
side effects (:issue:`3897`, :issue:`3976`).

As a result, when an item loader is initialized with an item,
:meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` once again
makes later calls to :meth:`ItemLoader.get_output_value()
<scrapy.loader.ItemLoader.get_output_value>` or :meth:`ItemLoader.load_item()
<scrapy.loader.ItemLoader.load_item>` return empty data.

.. _release-1.7.3:

Scrapy 1.7.3 (2019-08-01)
-------------------------

Enforce lxml 4.3.5 or lower for Python 3.4 (:issue:`3912`, :issue:`3918`).

.. _release-1.7.2:

Scrapy 1.7.2 (2019-07-23)
-------------------------

Fix Python 2 support (:issue:`3889`, :issue:`3893`, :issue:`3896`).

.. _release-1.7.1:

Scrapy 1.7.1 (2019-07-18)
-------------------------

Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

.. _release-1.7.0:

Scrapy 1.7.0 (2019-07-18)
-------------------------

.. note:: Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI
          is the result of an erroneous commit tagging and does not include all
          the changes described below.

Highlights:

* Improvements for crawls targeting multiple domains
* A cleaner way to pass arguments to callbacks
* A new class for JSON requests
* Improvements for rule-based spiders
* New features for feed exports

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   ``429`` is now part of the :setting:`RETRY_HTTP_CODES` setting by default

    This change is **backward incompatible**. If you don’t want to retry
    ``429``, you must override :setting:`RETRY_HTTP_CODES` accordingly.

*   :class:`~scrapy.crawler.Crawler`,
    :class:`CrawlerRunner.crawl <scrapy.crawler.CrawlerRunner.crawl>` and
    :class:`CrawlerRunner.create_crawler <scrapy.crawler.CrawlerRunner.create_crawler>`
    no longer accept a :class:`~scrapy.spiders.Spider` subclass instance, they
    only accept a :class:`~scrapy.spiders.Spider` subclass now.

    :class:`~scrapy.spiders.Spider` subclass instances were never meant to
    work, and they were not working as one would expect: instead of using the
    passed :class:`~scrapy.spiders.Spider` subclass instance, their
    :class:`~scrapy.spiders.Spider.from_crawler` method was called to generate
    a new instance.

*   Non-default values for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting
    may stop working. Scheduler priority queue classes now need to handle
    :class:`~scrapy.Request` objects instead of arbitrary Python data
    structures.

*   An additional ``crawler`` parameter has been added to the ``__init__``
    method of the :class:`~scrapy.core.scheduler.Scheduler` class. Custom
    scheduler subclasses which don't accept arbitrary parameters in their
    ``__init__`` method might break because of this change.

    For more information, see :setting:`SCHEDULER`.

See also :ref:`1.7-deprecation-removals` below.

New features
~~~~~~~~~~~~

*   A new scheduler priority queue,
    ``scrapy.pqueues.DownloaderAwarePriorityQueue``, may be
    :ref:`enabled <broad-crawls-scheduler-priority-queue>` for a significant
    scheduling improvement on crawls targeting multiple web domains, at the
    cost of no :setting:`CONCURRENT_REQUESTS_PER_IP` support (:issue:`3520`)

*   A new :attr:`.Request.cb_kwargs` attribute
    provides a cleaner way to pass keyword arguments to callback methods
    (:issue:`1138`, :issue:`3563`)

*   A new :class:`JSONRequest <scrapy.http.JsonRequest>` class offers a more
    convenient way to build JSON requests (:issue:`3504`, :issue:`3505`)

*   A ``process_request`` callback passed to the :class:`~scrapy.spiders.Rule`
    ``__init__`` method now receives the :class:`~scrapy.http.Response` object that
    originated the request as its second argument (:issue:`3682`)

*   A new ``restrict_text`` parameter for the
    :attr:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    ``__init__`` method allows filtering links by linking text (:issue:`3622`,
    :issue:`3635`)

*   A new :setting:`FEED_STORAGE_S3_ACL` setting allows defining a custom ACL
    for feeds exported to Amazon S3 (:issue:`3607`)

*   A new :setting:`FEED_STORAGE_FTP_ACTIVE` setting allows using FTP’s active
    connection mode for feeds exported to FTP servers (:issue:`3829`)

*   A new :setting:`METAREFRESH_IGNORE_TAGS` setting allows overriding which
    HTML tags are ignored when searching a response for HTML meta tags that
    trigger a redirect (:issue:`1422`, :issue:`3768`)

*   A new :reqmeta:`redirect_reasons` request meta key exposes the reason
    (status code, meta refresh) behind every followed redirect (:issue:`3581`,
    :issue:`3687`)

*   The ``SCRAPY_CHECK`` variable is now set to the ``true`` string during runs
    of the :command:`check` command, which allows :ref:`detecting contract
    check runs from code <detecting-contract-check-runs>` (:issue:`3704`,
    :issue:`3739`)

*   A new :meth:`Item.deepcopy() <scrapy.item.Item.deepcopy>` method makes it
    easier to :ref:`deep-copy items <copying-items>` (:issue:`1493`,
    :issue:`3671`)

*   :class:`~scrapy.extensions.corestats.CoreStats` also logs
    ``elapsed_time_seconds`` now (:issue:`3638`)

*   Exceptions from :class:`~scrapy.loader.ItemLoader` :ref:`input and output
    processors <topics-loaders-processors>` are now more verbose
    (:issue:`3836`, :issue:`3840`)

*   :class:`~scrapy.crawler.Crawler`,
    :class:`CrawlerRunner.crawl <scrapy.crawler.CrawlerRunner.crawl>` and
    :class:`CrawlerRunner.create_crawler <scrapy.crawler.CrawlerRunner.create_crawler>`
    now fail gracefully if they receive a :class:`~scrapy.spiders.Spider`
    subclass instance instead of the subclass itself (:issue:`2283`,
    :issue:`3610`, :issue:`3872`)

Bug fixes
~~~~~~~~~

*   :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_exception`
    is now also invoked for generators (:issue:`220`, :issue:`2061`)

*   System exceptions like KeyboardInterrupt_ are no longer caught
    (:issue:`3726`)

*   :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` no
    longer makes later calls to :meth:`ItemLoader.get_output_value()
    <scrapy.loader.ItemLoader.get_output_value>` or
    :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` return
    empty data (:issue:`3804`, :issue:`3819`)

*   The images pipeline (:class:`~scrapy.pipelines.images.ImagesPipeline`) no
    longer ignores these Amazon S3 settings: :setting:`AWS_ENDPOINT_URL`,
    :setting:`AWS_REGION_NAME`, :setting:`AWS_USE_SSL`, :setting:`AWS_VERIFY`
    (:issue:`3625`)

*   Fixed a memory leak in ``scrapy.pipelines.media.MediaPipeline`` affecting,
    for example, non-200 responses and exceptions from custom middlewares
    (:issue:`3813`)

*   Requests with private callbacks are now correctly unserialized from disk
    (:issue:`3790`)

*   :meth:`.FormRequest.from_response`
    now handles invalid methods like major web browsers (:issue:`3777`,
    :issue:`3794`)

Documentation
~~~~~~~~~~~~~

*   A new topic, :ref:`topics-dynamic-content`, covers recommended approaches
    to read dynamically-loaded data (:issue:`3703`)

*   :ref:`topics-broad-crawls` now features information about memory usage
    (:issue:`1264`, :issue:`3866`)

*   The documentation of :class:`~scrapy.spiders.Rule` now covers how to access
    the text of a link when using :class:`~scrapy.spiders.CrawlSpider`
    (:issue:`3711`, :issue:`3712`)

*   A new section, :ref:`httpcache-storage-custom`, covers writing a custom
    cache storage backend for
    :class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`
    (:issue:`3683`, :issue:`3692`)

*   A new :ref:`FAQ <faq>` entry, :ref:`faq-split-item`, explains what to do
    when you want to split an item into multiple items from an item pipeline
    (:issue:`2240`, :issue:`3672`)

*   Updated the :ref:`FAQ entry about crawl order <faq-bfo-dfo>` to explain why
    the first few requests rarely follow the desired order (:issue:`1739`,
    :issue:`3621`)

*   The :setting:`LOGSTATS_INTERVAL` setting (:issue:`3730`), the
    :meth:`FilesPipeline.file_path <scrapy.pipelines.files.FilesPipeline.file_path>`
    and
    :meth:`ImagesPipeline.file_path <scrapy.pipelines.images.ImagesPipeline.file_path>`
    methods (:issue:`2253`, :issue:`3609`) and the
    :meth:`Crawler.stop() <scrapy.crawler.Crawler.stop>` method (:issue:`3842`)
    are now documented

*   Some parts of the documentation that were confusing or misleading are now
    clearer (:issue:`1347`, :issue:`1789`, :issue:`2289`, :issue:`3069`,
    :issue:`3615`, :issue:`3626`, :issue:`3668`, :issue:`3670`, :issue:`3673`,
    :issue:`3728`, :issue:`3762`, :issue:`3861`, :issue:`3882`)

*   Minor documentation fixes (:issue:`3648`, :issue:`3649`, :issue:`3662`,
    :issue:`3674`, :issue:`3676`, :issue:`3694`, :issue:`3724`, :issue:`3764`,
    :issue:`3767`, :issue:`3791`, :issue:`3797`, :issue:`3806`, :issue:`3812`)

.. _1.7-deprecation-removals:

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

The following deprecated APIs have been removed (:issue:`3578`):

*   ``scrapy.conf`` (use :attr:`Crawler.settings
    <scrapy.crawler.Crawler.settings>`)

*   From ``scrapy.core.downloader.handlers``:

    *   ``http.HttpDownloadHandler`` (use ``http10.HTTP10DownloadHandler``)

*   ``scrapy.loader.ItemLoader._get_values`` (use ``_get_xpathvalues``)

*   ``scrapy.loader.XPathItemLoader`` (use :class:`~scrapy.loader.ItemLoader`)

*   ``scrapy.log`` (see :ref:`topics-logging`)

*   From ``scrapy.pipelines``:

    *   ``files.FilesPipeline.file_key`` (use ``file_path``)

    *   ``images.ImagesPipeline.file_key`` (use ``file_path``)

    *   ``images.ImagesPipeline.image_key`` (use ``file_path``)

    *   ``images.ImagesPipeline.thumb_key`` (use ``thumb_path``)

*   From both ``scrapy.selector`` and ``scrapy.selector.lxmlsel``:

    *   ``HtmlXPathSelector`` (use :class:`~scrapy.Selector`)

    *   ``XmlXPathSelector`` (use :class:`~scrapy.Selector`)

    *   ``XPathSelector`` (use :class:`~scrapy.Selector`)

    *   ``XPathSelectorList`` (use :class:`~scrapy.Selector`)

*   From ``scrapy.selector.csstranslator``:

    *   ``ScrapyGenericTranslator`` (use parsel.csstranslator.GenericTranslator_)

    *   ``ScrapyHTMLTranslator`` (use parsel.csstranslator.HTMLTranslator_)

    *   ``ScrapyXPathExpr`` (use parsel.csstranslator.XPathExpr_)

*   From :class:`~scrapy.Selector`:

    *   ``_root`` (both the ``__init__`` method argument and the object property, use
        ``root``)

    *   ``extract_unquoted`` (use ``getall``)

    *   ``select`` (use ``xpath``)

*   From :class:`~scrapy.selector.SelectorList`:

    *   ``extract_unquoted`` (use ``getall``)

    *   ``select`` (use ``xpath``)

    *   ``x`` (use ``xpath``)

*   ``scrapy.spiders.BaseSpider`` (use :class:`~scrapy.spiders.Spider`)

*   From :class:`~scrapy.spiders.Spider` (and subclasses):

    *   ``DOWNLOAD_DELAY`` (use :ref:`download_delay
        <spider-download_delay-attribute>`)

    *   ``set_crawler`` (use :meth:`~scrapy.spiders.Spider.from_crawler`)

*   ``scrapy.spiders.spiders`` (use :class:`~scrapy.spiderloader.SpiderLoader`)

*   ``scrapy.telnet`` (use :mod:`scrapy.extensions.telnet`)

*   From ``scrapy.utils.python``:

    *   ``str_to_unicode`` (use ``to_unicode``)

    *   ``unicode_to_str`` (use ``to_bytes``)

*   ``scrapy.utils.response.body_or_str``

The following deprecated settings have also been removed (:issue:`3578`):

*   ``SPIDER_MANAGER_CLASS`` (use :setting:`SPIDER_LOADER_CLASS`)

.. _1.7-deprecations:

Deprecations
~~~~~~~~~~~~

*   The ``queuelib.PriorityQueue`` value for the
    :setting:`SCHEDULER_PRIORITY_QUEUE` setting is deprecated. Use
    ``scrapy.pqueues.ScrapyPriorityQueue`` instead.

*   ``process_request`` callbacks passed to :class:`~scrapy.spiders.Rule` that
    do not accept two arguments are deprecated.

*   The following modules are deprecated:

    *   ``scrapy.utils.http`` (use `w3lib.http`_)

    *   ``scrapy.utils.markup`` (use `w3lib.html`_)

    *   ``scrapy.utils.multipart`` (use `urllib3`_)

*   The ``scrapy.utils.datatypes.MergeDict`` class is deprecated for Python 3
    code bases. Use :class:`~collections.ChainMap` instead. (:issue:`3878`)

*   The ``scrapy.utils.gz.is_gzipped`` function is deprecated. Use
    ``scrapy.utils.gz.gzip_magic_number`` instead.

.. _urllib3: https://urllib3.readthedocs.io/en/latest/index.html
.. _w3lib.html: https://w3lib.readthedocs.io/en/latest/w3lib.html#module-w3lib.html
.. _w3lib.http: https://w3lib.readthedocs.io/en/latest/w3lib.html#module-w3lib.http

Other changes
~~~~~~~~~~~~~

*   It is now possible to run all tests from the same tox_ environment in
    parallel; the documentation now covers :ref:`this and other ways to run
    tests <running-tests>` (:issue:`3707`)

*   It is now possible to generate an API documentation coverage report
    (:issue:`3806`, :issue:`3810`, :issue:`3860`)

*   The :ref:`documentation policies <documentation-policies>` now require
    docstrings_ (:issue:`3701`) that follow `PEP 257`_ (:issue:`3748`)

*   Internal fixes and cleanup (:issue:`3629`, :issue:`3643`, :issue:`3684`,
    :issue:`3698`, :issue:`3734`, :issue:`3735`, :issue:`3736`, :issue:`3737`,
    :issue:`3809`, :issue:`3821`, :issue:`3825`, :issue:`3827`, :issue:`3833`,
    :issue:`3857`, :issue:`3877`)

.. _release-1.6.0:

Scrapy 1.6.0 (2019-01-30)
-------------------------

Highlights:

* better Windows support;
* Python 3.7 compatibility;
* big documentation improvements, including a switch
  from ``.extract_first()`` + ``.extract()`` API to ``.get()`` + ``.getall()``
  API;
* feed exports, FilePipeline and MediaPipeline improvements;
* better extensibility: :signal:`item_error` and
  :signal:`request_reached_downloader` signals; ``from_crawler`` support
  for feed exporters, feed storages and dupefilters.
* ``scrapy.contracts`` fixes and new features;
* telnet console security improvements, first released as a
  backport in :ref:`release-1.5.2`;
* clean-up of the deprecated code;
* various bug fixes, small new features and usability improvements across
  the codebase.

Selector API changes
~~~~~~~~~~~~~~~~~~~~

While these are not changes in Scrapy itself, but rather in the parsel_
library which Scrapy uses for xpath/css selectors, these changes are
worth mentioning here. Scrapy now depends on parsel >= 1.5, and
Scrapy documentation is updated to follow recent ``parsel`` API conventions.

Most visible change is that ``.get()`` and ``.getall()`` selector
methods are now preferred over ``.extract_first()`` and ``.extract()``.
We feel that these new methods result in a more concise and readable code.
See :ref:`old-extraction-api` for more details.

.. note::
    There are currently **no plans** to deprecate ``.extract()``
    and ``.extract_first()`` methods.

Another useful new feature is the introduction of ``Selector.attrib`` and
``SelectorList.attrib`` properties, which make it easier to get
attributes of HTML elements. See :ref:`selecting-attributes`.

CSS selectors are cached in parsel >= 1.5, which makes them faster
when the same CSS path is used many times. This is very common in
case of Scrapy spiders: callbacks are usually called several times,
on different pages.

If you're using custom ``Selector`` or ``SelectorList`` subclasses,
a **backward incompatible** change in parsel may affect your code.
See `parsel changelog`_ for a detailed description, as well as for the
full list of improvements.

.. _parsel changelog: https://parsel.readthedocs.io/en/latest/history.html

Telnet console
~~~~~~~~~~~~~~

**Backward incompatible**: Scrapy's telnet console now requires username
and password. See :ref:`topics-telnetconsole` for more details. This change
fixes a **security issue**; see :ref:`release-1.5.2` release notes for details.

New extensibility features
~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``from_crawler`` support is added to feed exporters and feed storages. This,
  among other things, allows to access Scrapy settings from custom feed
  storages and exporters (:issue:`1605`, :issue:`3348`).
* ``from_crawler`` support is added to dupefilters (:issue:`2956`); this allows
  to access e.g. settings or a spider from a dupefilter.
* :signal:`item_error` is fired when an error happens in a pipeline
  (:issue:`3256`);
* :signal:`request_reached_downloader` is fired when Downloader gets
  a new Request; this signal can be useful e.g. for custom Schedulers
  (:issue:`3393`).
* new SitemapSpider :meth:`~.SitemapSpider.sitemap_filter` method which allows
  to select sitemap entries based on their attributes in SitemapSpider
  subclasses (:issue:`3512`).
* Lazy loading of Downloader Handlers is now optional; this enables better
  initialization error handling in custom Downloader Handlers (:issue:`3394`).

New FilePipeline and MediaPipeline features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Expose more options for S3FilesStore: :setting:`AWS_ENDPOINT_URL`,
  :setting:`AWS_USE_SSL`, :setting:`AWS_VERIFY`, :setting:`AWS_REGION_NAME`.
  For example, this allows to use alternative or self-hosted
  AWS-compatible providers (:issue:`2609`, :issue:`3548`).
* ACL support for Google Cloud Storage: :setting:`FILES_STORE_GCS_ACL` and
  :setting:`IMAGES_STORE_GCS_ACL` (:issue:`3199`).

``scrapy.contracts`` improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Exceptions in contracts code are handled better (:issue:`3377`);
* ``dont_filter=True`` is used for contract requests, which allows to test
  different callbacks with the same URL (:issue:`3381`);
* ``request_cls`` attribute in Contract subclasses allow to use different
  Request classes in contracts, for example FormRequest (:issue:`3383`).
* Fixed errback handling in contracts, e.g. for cases where a contract
  is executed for URL which returns non-200 response (:issue:`3371`).

Usability improvements
~~~~~~~~~~~~~~~~~~~~~~

* more stats for RobotsTxtMiddleware (:issue:`3100`)
* INFO log level is used to show telnet host/port (:issue:`3115`)
* a message is added to IgnoreRequest in RobotsTxtMiddleware (:issue:`3113`)
* better validation of ``url`` argument in ``Response.follow`` (:issue:`3131`)
* non-zero exit code is returned from Scrapy commands when error happens
  on spider initialization (:issue:`3226`)
* Link extraction improvements: "ftp" is added to scheme list (:issue:`3152`);
  "flv" is added to common video extensions (:issue:`3165`)
* better error message when an exporter is disabled (:issue:`3358`);
* ``scrapy shell --help`` mentions syntax required for local files
  (``./file.html``) - :issue:`3496`.
* Referer header value is added to RFPDupeFilter log messages (:issue:`3588`)

Bug fixes
~~~~~~~~~

* fixed issue with extra blank lines in .csv exports under Windows
  (:issue:`3039`);
* proper handling of pickling errors in Python 3 when serializing objects
  for disk queues (:issue:`3082`)
* flags are now preserved when copying Requests (:issue:`3342`);
* FormRequest.from_response clickdata shouldn't ignore elements with
  ``input[type=image]`` (:issue:`3153`).
* FormRequest.from_response should preserve duplicate keys (:issue:`3247`)

Documentation improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~

* Docs are re-written to suggest .get/.getall API instead of
  .extract/.extract_first. Also, :ref:`topics-selectors` docs are updated
  and re-structured to match latest parsel docs; they now contain more topics,
  such as :ref:`selecting-attributes` or :ref:`topics-selectors-css-extensions`
  (:issue:`3390`).
* :ref:`topics-developer-tools` is a new tutorial which replaces
  old Firefox and Firebug tutorials (:issue:`3400`).
* SCRAPY_PROJECT environment variable is documented (:issue:`3518`);
* troubleshooting section is added to install instructions (:issue:`3517`);
* improved links to beginner resources in the tutorial
  (:issue:`3367`, :issue:`3468`);
* fixed :setting:`RETRY_HTTP_CODES` default values in docs (:issue:`3335`);
* remove unused ``DEPTH_STATS`` option from docs (:issue:`3245`);
* other cleanups (:issue:`3347`, :issue:`3350`, :issue:`3445`, :issue:`3544`,
  :issue:`3605`).

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

Compatibility shims for pre-1.0 Scrapy module names are removed
(:issue:`3318`):

* ``scrapy.command``
* ``scrapy.contrib`` (with all submodules)
* ``scrapy.contrib_exp`` (with all submodules)
* ``scrapy.dupefilter``
* ``scrapy.linkextractor``
* ``scrapy.project``
* ``scrapy.spider``
* ``scrapy.spidermanager``
* ``scrapy.squeue``
* ``scrapy.stats``
* ``scrapy.statscol``
* ``scrapy.utils.decorator``

See :ref:`module-relocations` for more information, or use suggestions
from Scrapy 1.5.x deprecation warnings to update your code.

Other deprecation removals:

* Deprecated scrapy.interfaces.ISpiderManager is removed; please use
  scrapy.interfaces.ISpiderLoader.
* Deprecated ``CrawlerSettings`` class is removed (:issue:`3327`).
* Deprecated ``Settings.overrides`` and ``Settings.defaults`` attributes
  are removed (:issue:`3327`, :issue:`3359`).

Other improvements, cleanups
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* All Scrapy tests now pass on Windows; Scrapy testing suite is executed
  in a Windows environment on CI (:issue:`3315`).
* Python 3.7 support (:issue:`3326`, :issue:`3150`, :issue:`3547`).
* Testing and CI fixes (:issue:`3526`, :issue:`3538`, :issue:`3308`,
  :issue:`3311`, :issue:`3309`, :issue:`3305`, :issue:`3210`, :issue:`3299`)
* ``scrapy.http.cookies.CookieJar.clear`` accepts "domain", "path" and "name"
  optional arguments (:issue:`3231`).
* additional files are included to sdist (:issue:`3495`);
* code style fixes (:issue:`3405`, :issue:`3304`);
* unneeded .strip() call is removed (:issue:`3519`);
* collections.deque is used to store MiddlewareManager methods instead
  of a list (:issue:`3476`)

.. _release-1.5.2:

Scrapy 1.5.2 (2019-01-22)
-------------------------

* *Security bugfix*: Telnet console extension can be easily exploited by rogue
  websites POSTing content to http://localhost:6023, we haven't found a way to
  exploit it from Scrapy, but it is very easy to trick a browser to do so and
  elevates the risk for local development environment.

  *The fix is backward incompatible*, it enables telnet user-password
  authentication by default with a random generated password. If you can't
  upgrade right away, please consider setting :setting:`TELNETCONSOLE_PORT`
  out of its default value.

  See :ref:`telnet console <topics-telnetconsole>` documentation for more info

* Backport CI build failure under GCE environment due to boto import error.

.. _release-1.5.1:

Scrapy 1.5.1 (2018-07-12)
-------------------------

This is a maintenance release with important bug fixes, but no new features:

* ``O(N^2)`` gzip decompression issue which affected Python 3 and PyPy
  is fixed (:issue:`3281`);
* skipping of TLS validation errors is improved (:issue:`3166`);
* Ctrl-C handling is fixed in Python 3.5+ (:issue:`3096`);
* testing fixes (:issue:`3092`, :issue:`3263`);
* documentation improvements (:issue:`3058`, :issue:`3059`, :issue:`3089`,
  :issue:`3123`, :issue:`3127`, :issue:`3189`, :issue:`3224`, :issue:`3280`,
  :issue:`3279`, :issue:`3201`, :issue:`3260`, :issue:`3284`, :issue:`3298`,
  :issue:`3294`).

.. _release-1.5.0:

Scrapy 1.5.0 (2017-12-29)
-------------------------

This release brings small new features and improvements across the codebase.
Some highlights:

* Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
* Crawling with proxy servers becomes more efficient, as connections
  to proxies can be reused now.
* Warnings, exception and logging messages are improved to make debugging
  easier.
* ``scrapy parse`` command now allows to set custom request meta via
  ``--meta`` argument.
* Compatibility with Python 3.6, PyPy and PyPy3 is improved;
  PyPy and PyPy3 are now supported officially, by running tests on CI.
* Better default handling of HTTP 308, 522 and 524 status codes.
* Documentation is improved, as usual.

Backward Incompatible Changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Scrapy 1.5 drops support for Python 3.3.
* Default Scrapy User-Agent now uses https link to scrapy.org (:issue:`2983`).
  **This is technically backward-incompatible**; override
  :setting:`USER_AGENT` if you relied on old value.
* Logging of settings overridden by ``custom_settings`` is fixed;
  **this is technically backward-incompatible** because the logger
  changes from ``[scrapy.utils.log]`` to ``[scrapy.crawler]``. If you're
  parsing Scrapy logs, please update your log parsers (:issue:`1343`).
* LinkExtractor now ignores ``m4v`` extension by default, this is change
  in behavior.
* 522 and 524 status codes are added to ``RETRY_HTTP_CODES`` (:issue:`2851`)

New features
~~~~~~~~~~~~

- Support ``<link>`` tags in ``Response.follow`` (:issue:`2785`)
- Support for ``ptpython`` REPL (:issue:`2654`)
- Google Cloud Storage support for FilesPipeline and ImagesPipeline
  (:issue:`2923`).
- New ``--meta`` option of the "scrapy parse" command allows to pass additional
  request.meta (:issue:`2883`)
- Populate spider variable when using ``shell.inspect_response`` (:issue:`2812`)
- Handle HTTP 308 Permanent Redirect (:issue:`2844`)
- Add 522 and 524 to ``RETRY_HTTP_CODES`` (:issue:`2851`)
- Log versions information at startup (:issue:`2857`)
- ``scrapy.mail.MailSender`` now works in Python 3 (it requires Twisted 17.9.0)
- Connections to proxy servers are reused (:issue:`2743`)
- Add template for a downloader middleware (:issue:`2755`)
- Explicit message for NotImplementedError when parse callback not defined
  (:issue:`2831`)
- CrawlerProcess got an option to disable installation of root log handler
  (:issue:`2921`)
- LinkExtractor now ignores ``m4v`` extension by default
- Better log messages for responses over :setting:`DOWNLOAD_WARNSIZE` and
  :setting:`DOWNLOAD_MAXSIZE` limits (:issue:`2927`)
- Show warning when a URL is put to ``Spider.allowed_domains`` instead of
  a domain (:issue:`2250`).

Bug fixes
~~~~~~~~~

- Fix logging of settings overridden by ``custom_settings``;
  **this is technically backward-incompatible** because the logger
  changes from ``[scrapy.utils.log]`` to ``[scrapy.crawler]``, so please
  update your log parsers if needed (:issue:`1343`)
- Default Scrapy User-Agent now uses https link to scrapy.org (:issue:`2983`).
  **This is technically backward-incompatible**; override
  :setting:`USER_AGENT` if you relied on old value.
- Fix PyPy and PyPy3 test failures, support them officially
  (:issue:`2793`, :issue:`2935`, :issue:`2990`, :issue:`3050`, :issue:`2213`,
  :issue:`3048`)
- Fix DNS resolver when ``DNSCACHE_ENABLED=False`` (:issue:`2811`)
- Add ``cryptography`` for Debian Jessie tox test env (:issue:`2848`)
- Add verification to check if Request callback is callable (:issue:`2766`)
- Port ``extras/qpsclient.py`` to Python 3 (:issue:`2849`)
- Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning
  (:issue:`2862`)
- Update deprecated test aliases (:issue:`2876`)
- Fix ``SitemapSpider`` support for alternate links (:issue:`2853`)

Docs
~~~~

- Added missing bullet point for the ``AUTOTHROTTLE_TARGET_CONCURRENCY``
  setting. (:issue:`2756`)
- Update Contributing docs, document new support channels
  (:issue:`2762`, :issue:`3038`)
- Include references to Scrapy subreddit in the docs
- Fix broken links; use ``https://`` for external links
  (:issue:`2978`, :issue:`2982`, :issue:`2958`)
- Document CloseSpider extension better (:issue:`2759`)
- Use ``pymongo.collection.Collection.insert_one()`` in MongoDB example
  (:issue:`2781`)
- Spelling mistake and typos
  (:issue:`2828`, :issue:`2837`, :issue:`2884`, :issue:`2924`)
- Clarify ``CSVFeedSpider.headers`` documentation (:issue:`2826`)
- Document ``DontCloseSpider`` exception and clarify ``spider_idle``
  (:issue:`2791`)
- Update "Releases" section in README (:issue:`2764`)
- Fix rst syntax in ``DOWNLOAD_FAIL_ON_DATALOSS`` docs (:issue:`2763`)
- Small fix in description of startproject arguments (:issue:`2866`)
- Clarify data types in Response.body docs (:issue:`2922`)
- Add a note about ``request.meta['depth']`` to DepthMiddleware docs (:issue:`2374`)
- Add a note about ``request.meta['dont_merge_cookies']`` to CookiesMiddleware
  docs (:issue:`2999`)
- Up-to-date example of project structure (:issue:`2964`, :issue:`2976`)
- A better example of ItemExporters usage (:issue:`2989`)
- Document ``from_crawler`` methods for spider and downloader middlewares
  (:issue:`3019`)

.. _release-1.4.0:

Scrapy 1.4.0 (2017-05-18)
-------------------------

Scrapy 1.4 does not bring that many breathtaking new features
but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and
password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings.
And if you're using Twisted version 17.1.0 or above, FTP is now available
with Python 3.

There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method
for creating requests; **it is now a recommended way to create Requests
in Scrapy spiders**. This method makes it easier to write correct
spiders; ``response.follow`` has several advantages over creating
``scrapy.Request`` objects directly:

* it handles relative URLs;
* it works properly with non-ascii URLs on non-UTF8 pages;
* in addition to absolute and relative URLs it supports Selectors;
  for ``<a>`` elements it can also extract their href values.

For example, instead of this::

    for href in response.css('li.page a::attr(href)').extract():
        url = response.urljoin(href)
        yield scrapy.Request(url, self.parse, encoding=response.encoding)

One can now write this::

    for a in response.css('li.page a'):
        yield response.follow(a, self.parse)

Link extractors are also improved. They work similarly to what a regular
modern browser would do: leading and trailing whitespace are removed
from attributes (think ``href="   http://example.com"``) when building
``Link`` objects. This whitespace-stripping also happens for ``action``
attributes with ``FormRequest``.

**Please also note that link extractors do not canonicalize URLs by default
anymore.** This was puzzling users every now and then, and it's not what
browsers do in fact, so we removed that extra transformation on extracted
links.

For those of you wanting more control on the ``Referer:`` header that Scrapy
sends when following links, you can set your own ``Referrer Policy``.
Prior to Scrapy 1.4, the default ``RefererMiddleware`` would simply and
blindly set it to the URL of the response that generated the HTTP request
(which could leak information on your URL seeds).
By default, Scrapy now behaves much like your regular browser does.
And this policy is fully customizable with W3C standard values
(or with something really custom of your own if you wish).
See :setting:`REFERRER_POLICY` for details.

To make Scrapy spiders easier to debug, Scrapy logs more stats by default
in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code
stats. A similar change is that HTTP cache path is also visible in logs now.

Last but not least, Scrapy now has the option to make JSON and XML items
more human-readable, with newlines between items and even custom indenting
offset, using the new :setting:`FEED_EXPORT_INDENT` setting.

Enjoy! (Or read on for the rest of changes in this release.)

Deprecations and Backward Incompatible Changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Default to ``canonicalize=False`` in
  :class:`scrapy.linkextractors.LinkExtractor
  <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
  (:issue:`2537`, fixes :issue:`1941` and :issue:`1982`):
  **warning, this is technically backward-incompatible**
- Enable memusage extension by default (:issue:`2539`, fixes :issue:`2187`);
  **this is technically backward-incompatible** so please check if you have
  any non-default ``MEMUSAGE_***`` options set.
- ``EDITOR`` environment variable now takes precedence over ``EDITOR``
  option defined in settings.py (:issue:`1829`); Scrapy default settings
  no longer depend on environment variables. **This is technically a backward
  incompatible change**.
- ``Spider.make_requests_from_url`` is deprecated
  (:issue:`1728`, fixes :issue:`1495`).

New Features
~~~~~~~~~~~~

- Accept proxy credentials in :reqmeta:`proxy` request meta key (:issue:`2526`)
- Support `brotli-compressed`_ content; requires optional `brotlipy`_
  (:issue:`2535`)
- New :ref:`response.follow <response-follow-example>` shortcut
  for creating requests (:issue:`1940`)
- Added ``flags`` argument and attribute to :class:`~scrapy.Request`
  objects (:issue:`2047`)
- Support Anonymous FTP (:issue:`2342`)
- Added ``retry/count``, ``retry/max_reached`` and ``retry/reason_count/<reason>``
  stats to :class:`RetryMiddleware <scrapy.downloadermiddlewares.retry.RetryMiddleware>`
  (:issue:`2543`)
- Added ``httperror/response_ignored_count`` and ``httperror/response_ignored_status_count/<status>``
  stats to :class:`HttpErrorMiddleware <scrapy.spidermiddlewares.httperror.HttpErrorMiddleware>`
  (:issue:`2566`)
- Customizable :setting:`Referrer policy <REFERRER_POLICY>` in
  :class:`RefererMiddleware <scrapy.spidermiddlewares.referer.RefererMiddleware>`
  (:issue:`2306`)
- New ``data:`` URI download handler (:issue:`2334`, fixes :issue:`2156`)
- Log cache directory when HTTP Cache is used (:issue:`2611`, fixes :issue:`2604`)
- Warn users when project contains duplicate spider names (fixes :issue:`2181`)
- ``scrapy.utils.datatypes.CaselessDict`` now accepts ``Mapping`` instances and
  not only dicts (:issue:`2646`)
- :ref:`Media downloads <topics-media-pipeline>`, with
  :class:`~scrapy.pipelines.files.FilesPipeline` or
  :class:`~scrapy.pipelines.images.ImagesPipeline`, can now optionally handle
  HTTP redirects using the new :setting:`MEDIA_ALLOW_REDIRECTS` setting
  (:issue:`2616`, fixes :issue:`2004`)
- Accept non-complete responses from websites using a new
  :setting:`DOWNLOAD_FAIL_ON_DATALOSS` setting (:issue:`2590`, fixes :issue:`2586`)
- Optional pretty-printing of JSON and XML items via
  :setting:`FEED_EXPORT_INDENT` setting (:issue:`2456`, fixes :issue:`1327`)
- Allow dropping fields in ``FormRequest.from_response`` formdata when
  ``None`` value is passed (:issue:`667`)
- Per-request retry times with the new :reqmeta:`max_retry_times` meta key
  (:issue:`2642`)
- ``python -m scrapy`` as a more explicit alternative to ``scrapy`` command
  (:issue:`2740`)

.. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt
.. _brotlipy: https://github.com/python-hyper/brotlipy/

Bug fixes
~~~~~~~~~

- LinkExtractor now strips leading and trailing whitespaces from attributes
  (:issue:`2547`, fixes :issue:`1614`)
- Properly handle whitespaces in action attribute in
  :class:`~scrapy.FormRequest` (:issue:`2548`)
- Buffer CONNECT response bytes from proxy until all HTTP headers are received
  (:issue:`2495`, fixes :issue:`2491`)
- FTP downloader now works on Python 3, provided you use Twisted>=17.1
  (:issue:`2599`)
- Use body to choose response type after decompressing content (:issue:`2393`,
  fixes :issue:`2145`)
- Always decompress ``Content-Encoding: gzip`` at :class:`HttpCompressionMiddleware
  <scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware>` stage (:issue:`2391`)
- Respect custom log level in ``Spider.custom_settings`` (:issue:`2581`,
  fixes :issue:`1612`)
- 'make htmlview' fix for macOS (:issue:`2661`)
- Remove "commands" from the command list  (:issue:`2695`)
- Fix duplicate Content-Length header for POST requests with empty body (:issue:`2677`)
- Properly cancel large downloads, i.e. above :setting:`DOWNLOAD_MAXSIZE` (:issue:`1616`)
- ImagesPipeline: fixed processing of transparent PNG images with palette
  (:issue:`2675`)

Cleanups & Refactoring
~~~~~~~~~~~~~~~~~~~~~~

- Tests: remove temp files and folders (:issue:`2570`),
  fixed ProjectUtilsTest on macOS (:issue:`2569`),
  use portable pypy for Linux on Travis CI (:issue:`2710`)
- Separate building request from ``_requests_to_follow`` in CrawlSpider (:issue:`2562`)
- Remove “Python 3 progress” badge (:issue:`2567`)
- Add a couple more lines to ``.gitignore`` (:issue:`2557`)
- Remove bumpversion prerelease configuration (:issue:`2159`)
- Add codecov.yml file (:issue:`2750`)
- Set context factory implementation based on Twisted version (:issue:`2577`,
  fixes :issue:`2560`)
- Add omitted ``self`` arguments in default project middleware template (:issue:`2595`)
- Remove redundant ``slot.add_request()`` call in ExecutionEngine (:issue:`2617`)
- Catch more specific ``os.error`` exception in
  ``scrapy.pipelines.files.FSFilesStore`` (:issue:`2644`)
- Change "localhost" test server certificate (:issue:`2720`)
- Remove unused ``MEMUSAGE_REPORT`` setting (:issue:`2576`)

Documentation
~~~~~~~~~~~~~

- Binary mode is required for exporters (:issue:`2564`, fixes :issue:`2553`)
- Mention issue with :meth:`.FormRequest.from_response` due to bug in lxml (:issue:`2572`)
- Use single quotes uniformly in templates (:issue:`2596`)
- Document :reqmeta:`ftp_user` and :reqmeta:`ftp_password` meta keys (:issue:`2587`)
- Removed section on deprecated ``contrib/`` (:issue:`2636`)
- Recommend Anaconda when installing Scrapy on Windows
  (:issue:`2477`, fixes :issue:`2475`)
- FAQ: rewrite note on Python 3 support on Windows (:issue:`2690`)
- Rearrange selector sections (:issue:`2705`)
- Remove ``__nonzero__`` from :class:`~scrapy.selector.SelectorList`
  docs (:issue:`2683`)
- Mention how to disable request filtering in documentation of
  :setting:`DUPEFILTER_CLASS` setting (:issue:`2714`)
- Add sphinx_rtd_theme to docs setup readme (:issue:`2668`)
- Open file in text mode in JSON item writer example (:issue:`2729`)
- Clarify ``allowed_domains`` example (:issue:`2670`)

.. _release-1.3.3:

Scrapy 1.3.3 (2017-03-10)
-------------------------

Bug fixes
~~~~~~~~~

- Make ``SpiderLoader`` raise ``ImportError`` again by default for missing
  dependencies and wrong :setting:`SPIDER_MODULES`.
  These exceptions were silenced as warnings since 1.3.0.
  A new setting is introduced to toggle between warning or exception if needed ;
  see :setting:`SPIDER_LOADER_WARN_ONLY` for details.

.. _release-1.3.2:

Scrapy 1.3.2 (2017-02-13)
-------------------------

Bug fixes
~~~~~~~~~

- Preserve request class when converting to/from dicts (utils.reqser) (:issue:`2510`).
- Use consistent selectors for author field in tutorial (:issue:`2551`).
- Fix TLS compatibility in Twisted 17+ (:issue:`2558`)

.. _release-1.3.1:

Scrapy 1.3.1 (2017-02-08)
-------------------------

New features
~~~~~~~~~~~~

- Support ``'True'`` and ``'False'`` string values for boolean settings (:issue:`2519`);
  you can now do something like ``scrapy crawl myspider -s REDIRECT_ENABLED=False``.
- Support kwargs with ``response.xpath()`` to use :ref:`XPath variables <topics-selectors-xpath-variables>`
  and ad-hoc namespaces declarations ;
  this requires at least Parsel v1.1 (:issue:`2457`).
- Add support for Python 3.6 (:issue:`2485`).
- Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet).

Bug fixes
~~~~~~~~~

- Enforce ``DNS_TIMEOUT`` setting (:issue:`2496`).
- Fix :command:`view` command ; it was a regression in v1.3.0 (:issue:`2503`).
- Fix tests regarding ``*_EXPIRES settings`` with Files/Images pipelines (:issue:`2460`).
- Fix name of generated pipeline class when using basic project template (:issue:`2466`).
- Fix compatibility with Twisted 17+ (:issue:`2496`, :issue:`2528`).
- Fix ``scrapy.Item`` inheritance on Python 3.6 (:issue:`2511`).
- Enforce numeric values for components order in ``SPIDER_MIDDLEWARES``,
  ``DOWNLOADER_MIDDLEWARES``, ``EXTENSIONS`` and ``SPIDER_CONTRACTS`` (:issue:`2420`).

Documentation
~~~~~~~~~~~~~

- Reword Code of Conduct section and upgrade to Contributor Covenant v1.4
  (:issue:`2469`).
- Clarify that passing spider arguments converts them to spider attributes
  (:issue:`2483`).
- Document ``formid`` argument on ``FormRequest.from_response()`` (:issue:`2497`).
- Add .rst extension to README files (:issue:`2507`).
- Mention LevelDB cache storage backend (:issue:`2525`).
- Use ``yield`` in sample callback code (:issue:`2533`).
- Add note about HTML entities decoding with ``.re()/.re_first()`` (:issue:`1704`).
- Typos (:issue:`2512`, :issue:`2534`, :issue:`2531`).

Cleanups
~~~~~~~~

- Remove redundant check in ``MetaRefreshMiddleware`` (:issue:`2542`).
- Faster checks in ``LinkExtractor`` for allow/deny patterns (:issue:`2538`).
- Remove dead code supporting old Twisted versions (:issue:`2544`).

.. _release-1.3.0:

Scrapy 1.3.0 (2016-12-21)
-------------------------

This release comes rather soon after 1.2.2 for one main reason:
it was found out that releases since 0.18 up to 1.2.2 (included) use
some backported code from Twisted (``scrapy.xlib.tx.*``),
even if newer Twisted modules are available.
Scrapy now uses ``twisted.web.client`` and ``twisted.internet.endpoints`` directly.
(See also cleanups below.)

As it is a major change, we wanted to get the bug fix out quickly
while not breaking any projects using the 1.2 series.

New Features
~~~~~~~~~~~~

- ``MailSender`` now accepts single strings as values for ``to`` and ``cc``
  arguments (:issue:`2272`)
- ``scrapy fetch url``, ``scrapy shell url`` and ``fetch(url)`` inside
  Scrapy shell now follow HTTP redirections by default (:issue:`2290`);
  See :command:`fetch` and :command:`shell` for details.
- ``HttpErrorMiddleware`` now logs errors with ``INFO`` level instead of ``DEBUG``;
  this is technically **backward incompatible** so please check your log parsers.
- By default, logger names now use a long-form path, e.g. ``[scrapy.extensions.logstats]``,
  instead of the shorter "top-level" variant of prior releases (e.g. ``[scrapy]``);
  this is **backward incompatible** if you have log parsers expecting the short
  logger name part. You can switch back to short logger names using :setting:`LOG_SHORT_NAMES`
  set to ``True``.

Dependencies & Cleanups
~~~~~~~~~~~~~~~~~~~~~~~

- Scrapy now requires Twisted >= 13.1 which is the case for many Linux
  distributions already.
- As a consequence, we got rid of ``scrapy.xlib.tx.*`` modules, which
  copied some of Twisted code for users stuck with an "old" Twisted version
- ``ChunkedTransferMiddleware`` is deprecated and removed from the default
  downloader middlewares.

.. _release-1.2.3:

Scrapy 1.2.3 (2017-03-03)
-------------------------

- Packaging fix: disallow unsupported Twisted versions in setup.py

.. _release-1.2.2:

Scrapy 1.2.2 (2016-12-06)
-------------------------

Bug fixes
~~~~~~~~~

- Fix a cryptic traceback when a pipeline fails on ``open_spider()`` (:issue:`2011`)
- Fix embedded IPython shell variables (fixing :issue:`396` that re-appeared
  in 1.2.0, fixed in :issue:`2418`)
- A couple of patches when dealing with robots.txt:

  - handle (non-standard) relative sitemap URLs (:issue:`2390`)
  - handle non-ASCII URLs and User-Agents in Python 2 (:issue:`2373`)

Documentation
~~~~~~~~~~~~~

- Document ``"download_latency"`` key in ``Request``'s ``meta`` dict (:issue:`2033`)
- Remove page on (deprecated & unsupported) Ubuntu packages from ToC (:issue:`2335`)
- A few fixed typos (:issue:`2346`, :issue:`2369`, :issue:`2369`, :issue:`2380`)
  and clarifications (:issue:`2354`, :issue:`2325`, :issue:`2414`)

Other changes
~~~~~~~~~~~~~

- Advertize `conda-forge`_ as Scrapy's official conda channel (:issue:`2387`)
- More helpful error messages when trying to use ``.css()`` or ``.xpath()``
  on non-Text Responses (:issue:`2264`)
- ``startproject`` command now generates a sample ``middlewares.py`` file (:issue:`2335`)
- Add more dependencies' version info in ``scrapy version`` verbose output (:issue:`2404`)
- Remove all ``*.pyc`` files from source distribution (:issue:`2386`)

.. _conda-forge: https://anaconda.org/conda-forge/scrapy

.. _release-1.2.1:

Scrapy 1.2.1 (2016-10-21)
-------------------------

Bug fixes
~~~~~~~~~

- Include OpenSSL's more permissive default ciphers when establishing
  TLS/SSL connections (:issue:`2314`).
- Fix "Location" HTTP header decoding on non-ASCII URL redirects (:issue:`2321`).

Documentation
~~~~~~~~~~~~~

- Fix JsonWriterPipeline example (:issue:`2302`).
- Various notes: :issue:`2330` on spider names,
  :issue:`2329` on middleware methods processing order,
  :issue:`2327` on getting multi-valued HTTP headers as lists.

Other changes
~~~~~~~~~~~~~

- Removed ``www.`` from ``start_urls`` in built-in spider templates (:issue:`2299`).

.. _release-1.2.0:

Scrapy 1.2.0 (2016-10-03)
-------------------------

New Features
~~~~~~~~~~~~

- New :setting:`FEED_EXPORT_ENCODING` setting to customize the encoding
  used when writing items to a file.
  This can be used to turn off ``\uXXXX`` escapes in JSON output.
  This is also useful for those wanting something else than UTF-8
  for XML or CSV output (:issue:`2034`).
- ``startproject`` command now supports an optional destination directory
  to override the default one based on the project name (:issue:`2005`).
- New :setting:`SCHEDULER_DEBUG` setting to log requests serialization
  failures (:issue:`1610`).
- JSON encoder now supports serialization of ``set`` instances (:issue:`2058`).
- Interpret ``application/json-amazonui-streaming`` as ``TextResponse`` (:issue:`1503`).
- ``scrapy`` is imported by default when using shell tools (:command:`shell`,
  :ref:`inspect_response <topics-shell-inspect-response>`) (:issue:`2248`).

Bug fixes
~~~~~~~~~

- DefaultRequestHeaders middleware now runs before UserAgent middleware
  (:issue:`2088`). **Warning: this is technically backward incompatible**,
  though we consider this a bug fix.
- HTTP cache extension and plugins that use the ``.scrapy`` data directory now
  work outside projects (:issue:`1581`).  **Warning: this is technically
  backward incompatible**, though we consider this a bug fix.
- ``Selector`` does not allow passing both ``response`` and ``text`` anymore
  (:issue:`2153`).
- Fixed logging of wrong callback name with ``scrapy parse`` (:issue:`2169`).
- Fix for an odd gzip decompression bug (:issue:`1606`).
- Fix for selected callbacks when using ``CrawlSpider`` with :command:`scrapy parse <parse>`
  (:issue:`2225`).
- Fix for invalid JSON and XML files when spider yields no items (:issue:`872`).
- Implement ``flush()`` for ``StreamLogger`` avoiding a warning in logs (:issue:`2125`).

Refactoring
~~~~~~~~~~~

- ``canonicalize_url`` has been moved to `w3lib.url`_ (:issue:`2168`).

.. _w3lib.url: https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.url.canonicalize_url

Tests & Requirements
~~~~~~~~~~~~~~~~~~~~

Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously
Ubuntu 12.04 Precise.
What this means in practice is that we run continuous integration tests
with these (main) packages versions at a minimum:
Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.

Scrapy may very well work with older versions of these packages
(the code base still has switches for older Twisted versions for example)
but it is not guaranteed (because it's not tested anymore).

Documentation
~~~~~~~~~~~~~

- Grammar fixes: :issue:`2128`, :issue:`1566`.
- Download stats badge removed from README (:issue:`2160`).
- New Scrapy :ref:`architecture diagram <topics-architecture>` (:issue:`2165`).
- Updated ``Response`` parameters documentation (:issue:`2197`).
- Reworded misleading :setting:`RANDOMIZE_DOWNLOAD_DELAY` description (:issue:`2190`).
- Add StackOverflow as a support channel (:issue:`2257`).

.. _release-1.1.4:

Scrapy 1.1.4 (2017-03-03)
-------------------------

- Packaging fix: disallow unsupported Twisted versions in setup.py

.. _release-1.1.3:

Scrapy 1.1.3 (2016-09-22)
-------------------------

Bug fixes
~~~~~~~~~

- Class attributes for subclasses of ``ImagesPipeline`` and ``FilesPipeline``
  work as they did before 1.1.1 (:issue:`2243`, fixes :issue:`2198`)

Documentation
~~~~~~~~~~~~~

- :ref:`Overview <intro-overview>` and :ref:`tutorial <intro-tutorial>`
  rewritten to use http://toscrape.com websites
  (:issue:`2236`, :issue:`2249`, :issue:`2252`).

.. _release-1.1.2:

Scrapy 1.1.2 (2016-08-18)
-------------------------

Bug fixes
~~~~~~~~~

- Introduce a missing :setting:`IMAGES_STORE_S3_ACL` setting to override
  the default ACL policy in ``ImagesPipeline`` when uploading images to S3
  (note that default ACL policy is "private" -- instead of "public-read" --
  since Scrapy 1.1.0)
- :setting:`IMAGES_EXPIRES` default value set back to 90
  (the regression was introduced in 1.1.1)

.. _release-1.1.1:

Scrapy 1.1.1 (2016-07-13)
-------------------------

Bug fixes
~~~~~~~~~

- Add "Host" header in CONNECT requests to HTTPS proxies (:issue:`2069`)
- Use response ``body`` when choosing response class
  (:issue:`2001`, fixes :issue:`2000`)
- Do not fail on canonicalizing URLs with wrong netlocs
  (:issue:`2038`, fixes :issue:`2010`)
- a few fixes for ``HttpCompressionMiddleware`` (and ``SitemapSpider``):

  - Do not decode HEAD responses (:issue:`2008`, fixes :issue:`1899`)
  - Handle charset parameter in gzip Content-Type header
    (:issue:`2050`, fixes :issue:`2049`)
  - Do not decompress gzip octet-stream responses
    (:issue:`2065`, fixes :issue:`2063`)

- Catch (and ignore with a warning) exception when verifying certificate
  against IP-address hosts (:issue:`2094`, fixes :issue:`2092`)
- Make ``FilesPipeline`` and ``ImagesPipeline`` backward compatible again
  regarding the use of legacy class attributes for customization
  (:issue:`1989`, fixes :issue:`1985`)

New features
~~~~~~~~~~~~

- Enable genspider command outside project folder (:issue:`2052`)
- Retry HTTPS CONNECT ``TunnelError`` by default (:issue:`1974`)

Documentation
~~~~~~~~~~~~~

- ``FEED_TEMPDIR`` setting at lexicographical position (:commit:`9b3c72c`)
- Use idiomatic ``.extract_first()`` in overview (:issue:`1994`)
- Update years in copyright notice (:commit:`c2c8036`)
- Add information and example on errbacks (:issue:`1995`)
- Use "url" variable in downloader middleware example (:issue:`2015`)
- Grammar fixes (:issue:`2054`, :issue:`2120`)
- New FAQ entry on using BeautifulSoup in spider callbacks (:issue:`2048`)
- Add notes about Scrapy not working on Windows with Python 3 (:issue:`2060`)
- Encourage complete titles in pull requests (:issue:`2026`)

Tests
~~~~~

- Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (:issue:`2095`)

.. _release-1.1.0:

Scrapy 1.1.0 (2016-05-11)
-------------------------

This 1.1 release brings a lot of interesting features and bug fixes:

- Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See
  :ref:`news_betapy3` for more details and some limitations.
- Hot new features:

  - Item loaders now support nested loaders (:issue:`1467`).
  - ``FormRequest.from_response`` improvements (:issue:`1382`, :issue:`1137`).
  - Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved
    AutoThrottle docs (:issue:`1324`).
  - Added ``response.text`` to get body as unicode (:issue:`1730`).
  - Anonymous S3 connections (:issue:`1358`).
  - Deferreds in downloader middlewares (:issue:`1473`). This enables better
    robots.txt handling (:issue:`1471`).
  - HTTP caching now follows RFC2616 more closely, added settings
    :setting:`HTTPCACHE_ALWAYS_STORE` and
    :setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`).
  - Selectors were extracted to the parsel_ library (:issue:`1409`). This means
    you can use Scrapy Selectors without Scrapy and also upgrade the
    selectors engine without needing to upgrade Scrapy.
  - HTTPS downloader now does TLS protocol negotiation by default,
    instead of forcing TLS 1.0. You can also set the SSL/TLS method
    using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`.

- These bug fixes may require your attention:

  - Don't retry bad requests (HTTP 400) by default (:issue:`1289`).
    If you need the old behavior, add ``400`` to :setting:`RETRY_HTTP_CODES`.
  - Fix shell files argument handling (:issue:`1710`, :issue:`1550`).
    If you try ``scrapy shell index.html`` it will try to load the URL ``http://index.html``,
    use ``scrapy shell ./index.html`` to load a local file.
  - Robots.txt compliance is now enabled by default for newly-created projects
    (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded
    before proceeding with the crawl (:issue:`1735`). If you want to disable
    this behavior, update :setting:`ROBOTSTXT_OBEY` in ``settings.py`` file
    after creating a new project.
  - Exporters now work on unicode, instead of bytes by default (:issue:`1080`).
    If you use :class:`~scrapy.exporters.PythonItemExporter`, you may want to
    update your code to disable binary mode which is now deprecated.
  - Accept XML node names containing dots as valid (:issue:`1533`).
  - When uploading files or images to S3 (with ``FilesPipeline`` or
    ``ImagesPipeline``), the default ACL policy is now "private" instead
    of "public" **Warning: backward incompatible!**.
    You can use :setting:`FILES_STORE_S3_ACL` to change it.
  - We've reimplemented ``canonicalize_url()`` for more correct output,
    especially for URLs with non-ASCII characters (:issue:`1947`).
    This could change link extractors output compared to previous Scrapy versions.
    This may also invalidate some cache entries you could still have from pre-1.1 runs.
    **Warning: backward incompatible!**.

Keep reading for more details on other improvements and bug fixes.

.. _news_betapy3:

Beta Python 3 Support
~~~~~~~~~~~~~~~~~~~~~

We have been hard at work to make Scrapy run on Python 3. As a result, now
you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some
features are still missing (and some may never be ported).

Almost all builtin extensions/middlewares are expected to work.
However, we are aware of some limitations in Python 3:

- Scrapy does not work on Windows with Python 3
- Sending emails is not supported
- FTP download handler is not supported
- Telnet console is not supported

Additional New Features and Enhancements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Scrapy now has a `Code of Conduct`_ (:issue:`1681`).
- Command line tool now has completion for zsh (:issue:`934`).
- Improvements to ``scrapy shell``:

  - Support for bpython and configure preferred Python shell via
    ``SCRAPY_PYTHON_SHELL`` (:issue:`1100`, :issue:`1444`).
  - Support URLs without scheme (:issue:`1498`)
    **Warning: backward incompatible!**
  - Bring back support for relative file path (:issue:`1710`, :issue:`1550`).

- Added :setting:`MEMUSAGE_CHECK_INTERVAL_SECONDS` setting to change default check
  interval (:issue:`1282`).
- Download handlers are now lazy-loaded on first request using their
  scheme (:issue:`1390`, :issue:`1421`).
- HTTPS download handlers do not force TLS 1.0 anymore; instead,
  OpenSSL's ``SSLv23_method()/TLS_method()`` is used allowing to try
  negotiating with the remote hosts the highest TLS protocol version
  it can (:issue:`1794`, :issue:`1629`).
- ``RedirectMiddleware`` now skips the status codes from
  ``handle_httpstatus_list`` on spider attribute
  or in ``Request``'s ``meta`` key (:issue:`1334`, :issue:`1364`,
  :issue:`1447`).
- Form submission:

  - now works with ``<button>`` elements too (:issue:`1469`).
  - an empty string is now used for submit buttons without a value
    (:issue:`1472`)

- Dict-like settings now have per-key priorities
  (:issue:`1135`, :issue:`1149` and :issue:`1586`).
- Sending non-ASCII emails (:issue:`1662`)
- ``CloseSpider`` and ``SpiderState`` extensions now get disabled if no relevant
  setting is set (:issue:`1723`, :issue:`1725`).
- Added method ``ExecutionEngine.close`` (:issue:`1423`).
- Added method ``CrawlerRunner.create_crawler`` (:issue:`1528`).
- Scheduler priority queue can now be customized via
  :setting:`SCHEDULER_PRIORITY_QUEUE` (:issue:`1822`).
- ``.pps`` links are now ignored by default in link extractors (:issue:`1835`).
- temporary data folder for FTP and S3 feed storages can be customized
  using a new :setting:`FEED_TEMPDIR` setting (:issue:`1847`).
- ``FilesPipeline`` and ``ImagesPipeline`` settings are now instance attributes
  instead of class attributes, enabling spider-specific behaviors (:issue:`1891`).
- ``JsonItemExporter`` now formats opening and closing square brackets
  on their own line (first and last lines of output file) (:issue:`1950`).
- If available, ``botocore`` is used for ``S3FeedStorage``, ``S3DownloadHandler``
  and ``S3FilesStore`` (:issue:`1761`, :issue:`1883`).
- Tons of documentation updates and related fixes (:issue:`1291`, :issue:`1302`,
  :issue:`1335`, :issue:`1683`, :issue:`1660`, :issue:`1642`, :issue:`1721`,
  :issue:`1727`, :issue:`1879`).
- Other refactoring, optimizations and cleanup (:issue:`1476`, :issue:`1481`,
  :issue:`1477`, :issue:`1315`, :issue:`1290`, :issue:`1750`, :issue:`1881`).

.. _Code of Conduct: https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md

Deprecations and Removals
~~~~~~~~~~~~~~~~~~~~~~~~~

- Added ``to_bytes`` and ``to_unicode``, deprecated ``str_to_unicode`` and
  ``unicode_to_str`` functions (:issue:`778`).
- ``binary_is_text`` is introduced, to replace use of ``isbinarytext``
  (but with inverse return value) (:issue:`1851`)
- The ``optional_features`` set has been removed (:issue:`1359`).
- The ``--lsprof`` command line option has been removed (:issue:`1689`).
  **Warning: backward incompatible**, but doesn't break user code.
- The following datatypes were deprecated (:issue:`1720`):

  + ``scrapy.utils.datatypes.MultiValueDictKeyError``
  + ``scrapy.utils.datatypes.MultiValueDict``
  + ``scrapy.utils.datatypes.SiteNode``

- The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and
  replaced by `pydispatcher <https://pypi.org/project/PyDispatcher/>`_.

Relocations
~~~~~~~~~~~

- ``telnetconsole`` was relocated to ``extensions/`` (:issue:`1524`).

  + Note: telnet is not enabled on Python 3
    (https://github.com/scrapy/scrapy/pull/1524#issuecomment-146985595)

Bugfixes
~~~~~~~~

- Scrapy does not retry requests that got a ``HTTP 400 Bad Request``
  response anymore (:issue:`1289`). **Warning: backward incompatible!**
- Support empty password for http_proxy config (:issue:`1274`).
- Interpret ``application/x-json`` as ``TextResponse`` (:issue:`1333`).
- Support link rel attribute with multiple values (:issue:`1201`).
- Fixed ``scrapy.FormRequest.from_response`` when there is a ``<base>``
  tag (:issue:`1564`).
- Fixed :setting:`TEMPLATES_DIR` handling (:issue:`1575`).
- Various ``FormRequest`` fixes (:issue:`1595`, :issue:`1596`, :issue:`1597`).
- Makes ``_monkeypatches`` more robust (:issue:`1634`).
- Fixed bug on ``XMLItemExporter`` with non-string fields in
  items (:issue:`1738`).
- Fixed startproject command in macOS (:issue:`1635`).
- Fixed :class:`~scrapy.exporters.PythonItemExporter` and CSVExporter for
  non-string item types (:issue:`1737`).
- Various logging related fixes (:issue:`1294`, :issue:`1419`, :issue:`1263`,
  :issue:`1624`, :issue:`1654`, :issue:`1722`, :issue:`1726` and :issue:`1303`).
- Fixed bug in ``utils.template.render_templatefile()`` (:issue:`1212`).
- sitemaps extraction from ``robots.txt`` is now case-insensitive (:issue:`1902`).
- HTTPS+CONNECT tunnels could get mixed up when using multiple proxies
  to same remote host (:issue:`1912`).

.. _release-1.0.7:

Scrapy 1.0.7 (2017-03-03)
-------------------------

- Packaging fix: disallow unsupported Twisted versions in setup.py

.. _release-1.0.6:

Scrapy 1.0.6 (2016-05-04)
-------------------------

- FIX: RetryMiddleware is now robust to non-standard HTTP status codes (:issue:`1857`)
- FIX: Filestorage HTTP cache was checking wrong modified time (:issue:`1875`)
- DOC: Support for Sphinx 1.4+ (:issue:`1893`)
- DOC: Consistency in selectors examples (:issue:`1869`)

.. _release-1.0.5:

Scrapy 1.0.5 (2016-02-04)
-------------------------

- FIX: [Backport] Ignore bogus links in LinkExtractors (fixes :issue:`907`, :commit:`108195e`)
- TST: Changed buildbot makefile to use 'pytest' (:commit:`1f3d90a`)
- DOC: Fixed typos in tutorial and media-pipeline (:commit:`808a9ea` and :commit:`803bd87`)
- DOC: Add AjaxCrawlMiddleware to DOWNLOADER_MIDDLEWARES_BASE in settings docs (:commit:`aa94121`)

.. _release-1.0.4:

Scrapy 1.0.4 (2015-12-30)
-------------------------

- Ignoring xlib/tx folder, depending on Twisted version. (:commit:`7dfa979`)
- Run on new travis-ci infra (:commit:`6e42f0b`)
- Spelling fixes (:commit:`823a1cc`)
- escape nodename in xmliter regex (:commit:`da3c155`)
- test xml nodename with dots (:commit:`4418fc3`)
- TST don't use broken Pillow version in tests (:commit:`a55078c`)
- disable log on version command. closes #1426 (:commit:`86fc330`)
- disable log on startproject command (:commit:`db4c9fe`)
- Add PyPI download stats badge (:commit:`df2b944`)
- don't run tests twice on Travis if a PR is made from a scrapy/scrapy branch (:commit:`a83ab41`)
- Add Python 3 porting status badge to the README (:commit:`73ac80d`)
- fixed RFPDupeFilter persistence (:commit:`97d080e`)
- TST a test to show that dupefilter persistence is not working (:commit:`97f2fb3`)
- explicit close file on file:// scheme handler (:commit:`d9b4850`)
- Disable dupefilter in shell (:commit:`c0d0734`)
- DOC: Add captions to toctrees which appear in sidebar (:commit:`aa239ad`)
- DOC Removed pywin32 from install instructions as it's already declared as dependency. (:commit:`10eb400`)
- Added installation notes about using Conda for Windows and other OSes. (:commit:`1c3600a`)
- Fixed minor grammar issues. (:commit:`7f4ddd5`)
- fixed a typo in the documentation. (:commit:`b71f677`)
- Version 1 now exists (:commit:`5456c0e`)
- fix another invalid xpath error (:commit:`0a1366e`)
- fix ValueError: Invalid XPath: //div/[id="not-exists"]/text() on selectors.rst (:commit:`ca8d60f`)
- Typos corrections (:commit:`7067117`)
- fix typos in downloader-middleware.rst and exceptions.rst, middlware -> middleware (:commit:`32f115c`)
- Add note to Ubuntu install section about Debian compatibility (:commit:`23fda69`)
- Replace alternative macOS install workaround with virtualenv (:commit:`98b63ee`)
- Reference Homebrew's homepage for installation instructions (:commit:`1925db1`)
- Add oldest supported tox version to contributing docs (:commit:`5d10d6d`)
- Note in install docs about pip being already included in python>=2.7.9 (:commit:`85c980e`)
- Add non-python dependencies to Ubuntu install section in the docs (:commit:`fbd010d`)
- Add macOS installation section to docs (:commit:`d8f4cba`)
- DOC(ENH): specify path to rtd theme explicitly (:commit:`de73b1a`)
- minor: scrapy.Spider docs grammar (:commit:`1ddcc7b`)
- Make common practices sample code match the comments (:commit:`1b85bcf`)
- nextcall repetitive calls (heartbeats). (:commit:`55f7104`)
- Backport fix compatibility with Twisted 15.4.0 (:commit:`b262411`)
- pin pytest to 2.7.3 (:commit:`a6535c2`)
- Merge pull request #1512 from mgedmin/patch-1 (:commit:`8876111`)
- Merge pull request #1513 from mgedmin/patch-2 (:commit:`5d4daf8`)
- Typo (:commit:`f8d0682`)
- Fix list formatting (:commit:`5f83a93`)
- fix Scrapy squeue tests after recent changes to queuelib (:commit:`3365c01`)
- Merge pull request #1475 from rweindl/patch-1 (:commit:`2d688cd`)
- Update tutorial.rst (:commit:`fbc1f25`)
- Merge pull request #1449 from rhoekman/patch-1 (:commit:`7d6538c`)
- Small grammatical change (:commit:`8752294`)
- Add openssl version to version command (:commit:`13c45ac`)

.. _release-1.0.3:

Scrapy 1.0.3 (2015-08-11)
-------------------------

- add service_identity to Scrapy install_requires (:commit:`cbc2501`)
- Workaround for travis#296 (:commit:`66af9cd`)

.. _release-1.0.2:

Scrapy 1.0.2 (2015-08-06)
-------------------------

- Twisted 15.3.0 does not raises PicklingError serializing lambda functions (:commit:`b04dd7d`)
- Minor method name fix (:commit:`6f85c7f`)
- minor: scrapy.Spider grammar and clarity (:commit:`9c9d2e0`)
- Put a blurb about support channels in CONTRIBUTING (:commit:`c63882b`)
- Fixed typos (:commit:`a9ae7b0`)
- Fix doc reference. (:commit:`7c8a4fe`)

.. _release-1.0.1:

Scrapy 1.0.1 (2015-07-01)
-------------------------

- Unquote request path before passing to FTPClient, it already escape paths (:commit:`cc00ad2`)
- include tests/ to source distribution in MANIFEST.in (:commit:`eca227e`)
- DOC Fix SelectJmes documentation (:commit:`b8567bc`)
- DOC Bring Ubuntu and Archlinux outside of Windows subsection (:commit:`392233f`)
- DOC remove version suffix from Ubuntu package (:commit:`5303c66`)
- DOC Update release date for 1.0 (:commit:`c89fa29`)

.. _release-1.0.0:

Scrapy 1.0.0 (2015-06-19)
-------------------------

You will find a lot of new features and bugfixes in this major release.  Make
sure to check our updated :ref:`overview <intro-overview>` to get a glance of
some of the changes, along with our brushed :ref:`tutorial <intro-tutorial>`.

Support for returning dictionaries in spiders
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Declaring and returning Scrapy Items is no longer necessary to collect the
scraped data from your spider, you can now return explicit dictionaries
instead.

*Classic version*

::

    class MyItem(scrapy.Item):
        url = scrapy.Field()

    class MySpider(scrapy.Spider):
        def parse(self, response):
            return MyItem(url=response.url)

*New version*

::

    class MySpider(scrapy.Spider):
        def parse(self, response):
            return {'url': response.url}

Per-spider settings (GSoC 2014)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Last Google Summer of Code project accomplished an important redesign of the
mechanism used for populating settings, introducing explicit priorities to
override any given setting. As an extension of that goal, we included a new
level of priority for settings that act exclusively for a single spider,
allowing them to redefine project settings.

Start using it by defining a :attr:`~scrapy.spiders.Spider.custom_settings`
class variable in your spider::

    class MySpider(scrapy.Spider):
        custom_settings = {
            "DOWNLOAD_DELAY": 5.0,
            "RETRY_ENABLED": False,
        }

Read more about settings population: :ref:`topics-settings`

Python Logging
~~~~~~~~~~~~~~

Scrapy 1.0 has moved away from Twisted logging to support Python built in’s
as default logging system. We’re maintaining backward compatibility for most
of the old custom interface to call logging functions, but you’ll get
warnings to switch to the Python logging API entirely.

*Old version*

::

    from scrapy import log
    log.msg('MESSAGE', log.INFO)

*New version*

::

    import logging
    logging.info('MESSAGE')

Logging with spiders remains the same, but on top of the
:meth:`~scrapy.spiders.Spider.log` method you’ll have access to a custom
:attr:`~scrapy.spiders.Spider.logger` created for the spider to issue log
events:

::

    class MySpider(scrapy.Spider):
        def parse(self, response):
            self.logger.info('Response received')

Read more in the logging documentation: :ref:`topics-logging`

Crawler API refactoring (GSoC 2014)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Another milestone for last Google Summer of Code was a refactoring of the
internal API, seeking a simpler and easier usage. Check new core interface
in: :ref:`topics-api`

A common situation where you will face these changes is while running Scrapy
from scripts. Here’s a quick example of how to run a Spider manually with the
new API:

::

    from scrapy.crawler import CrawlerProcess

    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    process.crawl(MySpider)
    process.start()

Bear in mind this feature is still under development and its API may change
until it reaches a stable status.

See more examples for scripts running Scrapy: :ref:`topics-practices`

.. _module-relocations:

Module Relocations
~~~~~~~~~~~~~~~~~~

There’s been a large rearrangement of modules trying to improve the general
structure of Scrapy. Main changes were separating various subpackages into
new projects and dissolving both ``scrapy.contrib`` and ``scrapy.contrib_exp``
into top level packages. Backward compatibility was kept among internal
relocations, while importing deprecated modules expect warnings indicating
their new place.

Full list of relocations
************************

Outsourced packages

.. note::
    These extensions went through some minor changes, e.g. some setting names
    were changed. Please check the documentation in each new repository to
    get familiar with the new usage.

+-------------------------------------+-------------------------------------+
| Old location                        | New location                        |
+=====================================+=====================================+
| scrapy.commands.deploy              | `scrapyd-client <https://github.com |
|                                     | /scrapy/scrapyd-client>`_           |
|                                     | (See other alternatives here:       |
|                                     | :ref:`topics-deploy`)               |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.djangoitem           | `scrapy-djangoitem <https://github. |
|                                     | com/scrapy-plugins/scrapy-djangoite |
|                                     | m>`_                                |
+-------------------------------------+-------------------------------------+
| scrapy.webservice                   | `scrapy-jsonrpc <https://github.com |
|                                     | /scrapy-plugins/scrapy-jsonrpc>`_   |
+-------------------------------------+-------------------------------------+

``scrapy.contrib_exp`` and ``scrapy.contrib`` dissolutions

+-------------------------------------+-------------------------------------+
| Old location                        | New location                        |
+=====================================+=====================================+
| scrapy.contrib\_exp.downloadermidd\ | scrapy.downloadermiddlewares.decom\ |
| leware.decompression                | pression                            |
+-------------------------------------+-------------------------------------+
| scrapy.contrib\_exp.iterators       | scrapy.utils.iterators              |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.downloadermiddleware | scrapy.downloadermiddlewares        |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.exporter             | scrapy.exporters                    |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.linkextractors       | scrapy.linkextractors               |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.loader               | scrapy.loader                       |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.loader.processor     | scrapy.loader.processors            |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.pipeline             | scrapy.pipelines                    |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.spidermiddleware     | scrapy.spidermiddlewares            |
+-------------------------------------+-------------------------------------+
| scrapy.contrib.spiders              | scrapy.spiders                      |
+-------------------------------------+-------------------------------------+
| * scrapy.contrib.closespider        | scrapy.extensions.\*                |
| * scrapy.contrib.corestats          |                                     |
| * scrapy.contrib.debug              |                                     |
| * scrapy.contrib.feedexport         |                                     |
| * scrapy.contrib.httpcache          |                                     |
| * scrapy.contrib.logstats           |                                     |
| * scrapy.contrib.memdebug           |                                     |
| * scrapy.contrib.memusage           |                                     |
| * scrapy.contrib.spiderstate        |                                     |
| * scrapy.contrib.statsmailer        |                                     |
| * scrapy.contrib.throttle           |                                     |
+-------------------------------------+-------------------------------------+

Plural renames and Modules unification

+-------------------------------------+-------------------------------------+
| Old location                        | New location                        |
+=====================================+=====================================+
| scrapy.command                      | scrapy.commands                     |
+-------------------------------------+-------------------------------------+
| scrapy.dupefilter                   | scrapy.dupefilters                  |
+-------------------------------------+-------------------------------------+
| scrapy.linkextractor                | scrapy.linkextractors               |
+-------------------------------------+-------------------------------------+
| scrapy.spider                       | scrapy.spiders                      |
+-------------------------------------+-------------------------------------+
| scrapy.squeue                       | scrapy.squeues                      |
+-------------------------------------+-------------------------------------+
| scrapy.statscol                     | scrapy.statscollectors              |
+-------------------------------------+-------------------------------------+
| scrapy.utils.decorator              | scrapy.utils.decorators             |
+-------------------------------------+-------------------------------------+

Class renames

+-------------------------------------+-------------------------------------+
| Old location                        | New location                        |
+=====================================+=====================================+
| scrapy.spidermanager.SpiderManager  | scrapy.spiderloader.SpiderLoader    |
+-------------------------------------+-------------------------------------+

Settings renames

+-------------------------------------+-------------------------------------+
| Old location                        | New location                        |
+=====================================+=====================================+
| SPIDER\_MANAGER\_CLASS              | SPIDER\_LOADER\_CLASS               |
+-------------------------------------+-------------------------------------+

Changelog
~~~~~~~~~

New Features and Enhancements

- Python logging (:issue:`1060`, :issue:`1235`, :issue:`1236`, :issue:`1240`,
  :issue:`1259`, :issue:`1278`, :issue:`1286`)
- FEED_EXPORT_FIELDS option (:issue:`1159`, :issue:`1224`)
- Dns cache size and timeout options (:issue:`1132`)
- support namespace prefix in xmliter_lxml (:issue:`963`)
- Reactor threadpool max size setting (:issue:`1123`)
- Allow spiders to return dicts. (:issue:`1081`)
- Add Response.urljoin() helper (:issue:`1086`)
- look in ~/.config/scrapy.cfg for user config (:issue:`1098`)
- handle TLS SNI (:issue:`1101`)
- Selectorlist extract first (:issue:`624`, :issue:`1145`)
- Added JmesSelect (:issue:`1016`)
- add gzip compression to filesystem http cache backend (:issue:`1020`)
- CSS support in link extractors (:issue:`983`)
- httpcache dont_cache meta #19 #689 (:issue:`821`)
- add signal to be sent when request is dropped by the scheduler
  (:issue:`961`)
- avoid download large response (:issue:`946`)
- Allow to specify the quotechar in CSVFeedSpider (:issue:`882`)
- Add referer to "Spider error processing" log message (:issue:`795`)
- process robots.txt once (:issue:`896`)
- GSoC Per-spider settings (:issue:`854`)
- Add project name validation (:issue:`817`)
- GSoC API cleanup (:issue:`816`, :issue:`1128`, :issue:`1147`,
  :issue:`1148`, :issue:`1156`, :issue:`1185`, :issue:`1187`, :issue:`1258`,
  :issue:`1268`, :issue:`1276`, :issue:`1285`, :issue:`1284`)
- Be more responsive with IO operations (:issue:`1074` and :issue:`1075`)
- Do leveldb compaction for httpcache on closing (:issue:`1297`)

Deprecations and Removals

- Deprecate htmlparser link extractor (:issue:`1205`)
- remove deprecated code from FeedExporter (:issue:`1155`)
- a leftover for.15 compatibility (:issue:`925`)
- drop support for CONCURRENT_REQUESTS_PER_SPIDER (:issue:`895`)
- Drop old engine code (:issue:`911`)
- Deprecate SgmlLinkExtractor (:issue:`777`)

Relocations

- Move exporters/__init__.py to exporters.py (:issue:`1242`)
- Move base classes to their packages (:issue:`1218`, :issue:`1233`)
- Module relocation (:issue:`1181`, :issue:`1210`)
- rename SpiderManager to SpiderLoader (:issue:`1166`)
- Remove djangoitem (:issue:`1177`)
- remove scrapy deploy command (:issue:`1102`)
- dissolve contrib_exp (:issue:`1134`)
- Deleted bin folder from root, fixes #913 (:issue:`914`)
- Remove jsonrpc based webservice (:issue:`859`)
- Move Test cases under project root dir (:issue:`827`, :issue:`841`)
- Fix backward incompatibility for relocated paths in settings
  (:issue:`1267`)

Documentation

- CrawlerProcess documentation (:issue:`1190`)
- Favoring web scraping over screen scraping in the descriptions
  (:issue:`1188`)
- Some improvements for Scrapy tutorial (:issue:`1180`)
- Documenting Files Pipeline together with Images Pipeline (:issue:`1150`)
- deployment docs tweaks (:issue:`1164`)
- Added deployment section covering scrapyd-deploy and shub (:issue:`1124`)
- Adding more settings to project template (:issue:`1073`)
- some improvements to overview page (:issue:`1106`)
- Updated link in docs/topics/architecture.rst (:issue:`647`)
- DOC reorder topics (:issue:`1022`)
- updating list of Request.meta special keys (:issue:`1071`)
- DOC document download_timeout (:issue:`898`)
- DOC simplify extension docs (:issue:`893`)
- Leaks docs (:issue:`894`)
- DOC document from_crawler method for item pipelines (:issue:`904`)
- Spider_error doesn't support deferreds (:issue:`1292`)
- Corrections & Sphinx related fixes (:issue:`1220`, :issue:`1219`,
  :issue:`1196`, :issue:`1172`, :issue:`1171`, :issue:`1169`, :issue:`1160`,
  :issue:`1154`, :issue:`1127`, :issue:`1112`, :issue:`1105`, :issue:`1041`,
  :issue:`1082`, :issue:`1033`, :issue:`944`, :issue:`866`, :issue:`864`,
  :issue:`796`, :issue:`1260`, :issue:`1271`, :issue:`1293`, :issue:`1298`)

Bugfixes

- Item multi inheritance fix (:issue:`353`, :issue:`1228`)
- ItemLoader.load_item: iterate over copy of fields (:issue:`722`)
- Fix Unhandled error in Deferred (RobotsTxtMiddleware) (:issue:`1131`,
  :issue:`1197`)
- Force to read DOWNLOAD_TIMEOUT as int (:issue:`954`)
- scrapy.utils.misc.load_object should print full traceback (:issue:`902`)
- Fix bug for ".local" host name (:issue:`878`)
- Fix for Enabled extensions, middlewares, pipelines info not printed
  anymore (:issue:`879`)
- fix dont_merge_cookies bad behaviour when set to false on meta
  (:issue:`846`)

Python 3 In Progress Support

- disable scrapy.telnet if twisted.conch is not available (:issue:`1161`)
- fix Python 3 syntax errors in ajaxcrawl.py (:issue:`1162`)
- more python3 compatibility changes for urllib (:issue:`1121`)
- assertItemsEqual was renamed to assertCountEqual in Python 3.
  (:issue:`1070`)
- Import unittest.mock if available. (:issue:`1066`)
- updated deprecated cgi.parse_qsl to use six's parse_qsl (:issue:`909`)
- Prevent Python 3 port regressions (:issue:`830`)
- PY3: use MutableMapping for python 3 (:issue:`810`)
- PY3: use six.BytesIO and six.moves.cStringIO (:issue:`803`)
- PY3: fix xmlrpclib and email imports (:issue:`801`)
- PY3: use six for robotparser and urlparse (:issue:`800`)
- PY3: use six.iterkeys, six.iteritems, and tempfile (:issue:`799`)
- PY3: fix has_key and use six.moves.configparser (:issue:`798`)
- PY3: use six.moves.cPickle (:issue:`797`)
- PY3 make it possible to run some tests in Python3 (:issue:`776`)

Tests

- remove unnecessary lines from py3-ignores (:issue:`1243`)
- Fix remaining warnings from pytest while collecting tests (:issue:`1206`)
- Add docs build to travis (:issue:`1234`)
- TST don't collect tests from deprecated modules. (:issue:`1165`)
- install service_identity package in tests to prevent warnings
  (:issue:`1168`)
- Fix deprecated settings API in tests (:issue:`1152`)
- Add test for webclient with POST method and no body given (:issue:`1089`)
- py3-ignores.txt supports comments (:issue:`1044`)
- modernize some of the asserts (:issue:`835`)
- selector.__repr__ test (:issue:`779`)

Code refactoring

- CSVFeedSpider cleanup: use iterate_spider_output (:issue:`1079`)
- remove unnecessary check from scrapy.utils.spider.iter_spider_output
  (:issue:`1078`)
- Pydispatch pep8 (:issue:`992`)
- Removed unused 'load=False' parameter from walk_modules() (:issue:`871`)
- For consistency, use ``job_dir`` helper in ``SpiderState`` extension.
  (:issue:`805`)
- rename "sflo" local variables to less cryptic "log_observer" (:issue:`775`)

Scrapy 0.24.6 (2015-04-20)
--------------------------

- encode invalid xpath with unicode_escape under PY2 (:commit:`07cb3e5`)
- fix IPython shell scope issue and load IPython user config (:commit:`2c8e573`)
- Fix small typo in the docs (:commit:`d694019`)
- Fix small typo (:commit:`f92fa83`)
- Converted sel.xpath() calls to response.xpath() in Extracting the data (:commit:`c2c6d15`)

Scrapy 0.24.5 (2015-02-25)
--------------------------

- Support new _getEndpoint Agent signatures on Twisted 15.0.0 (:commit:`540b9bc`)
- DOC a couple more references are fixed (:commit:`b4c454b`)
- DOC fix a reference (:commit:`e3c1260`)
- t.i.b.ThreadedResolver is now a new-style class (:commit:`9e13f42`)
- S3DownloadHandler: fix auth for requests with quoted paths/query params (:commit:`cdb9a0b`)
- fixed the variable types in mailsender documentation (:commit:`bb3a848`)
- Reset items_scraped instead of item_count (:commit:`edb07a4`)
- Tentative attention message about what document to read for contributions (:commit:`7ee6f7a`)
- mitmproxy 0.10.1 needs netlib 0.10.1 too (:commit:`874fcdd`)
- pin mitmproxy 0.10.1 as >0.11 does not work with tests (:commit:`c6b21f0`)
- Test the parse command locally instead of against an external url (:commit:`c3a6628`)
- Patches Twisted issue while closing the connection pool on HTTPDownloadHandler (:commit:`d0bf957`)
- Updates documentation on dynamic item classes. (:commit:`eeb589a`)
- Merge pull request #943 from Lazar-T/patch-3 (:commit:`5fdab02`)
- typo (:commit:`b0ae199`)
- pywin32 is required by Twisted. closes #937 (:commit:`5cb0cfb`)
- Update install.rst (:commit:`781286b`)
- Merge pull request #928 from Lazar-T/patch-1 (:commit:`b415d04`)
- comma instead of fullstop (:commit:`627b9ba`)
- Merge pull request #885 from jsma/patch-1 (:commit:`de909ad`)
- Update request-response.rst (:commit:`3f3263d`)
- SgmlLinkExtractor - fix for parsing <area> tag with Unicode present (:commit:`49b40f0`)

Scrapy 0.24.4 (2014-08-09)
--------------------------

- pem file is used by mockserver and required by scrapy bench (:commit:`5eddc68b63`)
- scrapy bench needs scrapy.tests* (:commit:`d6cb999`)

Scrapy 0.24.3 (2014-08-09)
--------------------------

- no need to waste travis-ci time on py3 for 0.24 (:commit:`8e080c1`)
- Update installation docs (:commit:`1d0c096`)
- There is a trove classifier for Scrapy framework! (:commit:`4c701d7`)
- update other places where w3lib version is mentioned (:commit:`d109c13`)
- Update w3lib requirement to 1.8.0 (:commit:`39d2ce5`)
- Use w3lib.html.replace_entities() (remove_entities() is deprecated) (:commit:`180d3ad`)
- set zip_safe=False (:commit:`a51ee8b`)
- do not ship tests package (:commit:`ee3b371`)
- scrapy.bat is not needed anymore (:commit:`c3861cf`)
- Modernize setup.py (:commit:`362e322`)
- headers can not handle non-string values (:commit:`94a5c65`)
- fix ftp test cases (:commit:`a274a7f`)
- The sum up of travis-ci builds are taking like 50min to complete (:commit:`ae1e2cc`)
- Update shell.rst typo (:commit:`e49c96a`)
- removes weird indentation in the shell results (:commit:`1ca489d`)
- improved explanations, clarified blog post as source, added link for XPath string functions in the spec (:commit:`65c8f05`)
- renamed UserTimeoutError and ServerTimeouterror #583 (:commit:`037f6ab`)
- adding some xpath tips to selectors docs (:commit:`2d103e0`)
- fix tests to account for https://github.com/scrapy/w3lib/pull/23 (:commit:`f8d366a`)
- get_func_args maximum recursion fix #728 (:commit:`81344ea`)
- Updated input/output processor example according to #560. (:commit:`f7c4ea8`)
- Fixed Python syntax in tutorial. (:commit:`db59ed9`)
- Add test case for tunneling proxy (:commit:`f090260`)
- Bugfix for leaking Proxy-Authorization header to remote host when using tunneling (:commit:`d8793af`)
- Extract links from XHTML documents with MIME-Type "application/xml" (:commit:`ed1f376`)
- Merge pull request #793 from roysc/patch-1 (:commit:`91a1106`)
- Fix typo in commands.rst (:commit:`743e1e2`)
- better testcase for settings.overrides.setdefault (:commit:`e22daaf`)
- Using CRLF as line marker according to http 1.1 definition (:commit:`5ec430b`)

Scrapy 0.24.2 (2014-07-08)
--------------------------

- Use a mutable mapping to proxy deprecated settings.overrides and settings.defaults attribute (:commit:`e5e8133`)
- there is not support for python3 yet (:commit:`3cd6146`)
- Update python compatible version set to Debian packages (:commit:`fa5d76b`)
- DOC fix formatting in release notes (:commit:`c6a9e20`)

Scrapy 0.24.1 (2014-06-27)
--------------------------

- Fix deprecated CrawlerSettings and increase backward compatibility with
  .defaults attribute (:commit:`8e3f20a`)

Scrapy 0.24.0 (2014-06-26)
--------------------------

Enhancements
~~~~~~~~~~~~

- Improve Scrapy top-level namespace (:issue:`494`, :issue:`684`)
- Add selector shortcuts to responses (:issue:`554`, :issue:`690`)
- Add new lxml based LinkExtractor to replace unmaintained SgmlLinkExtractor
  (:issue:`559`, :issue:`761`, :issue:`763`)
- Cleanup settings API - part of per-spider settings **GSoC project** (:issue:`737`)
- Add UTF8 encoding header to templates (:issue:`688`, :issue:`762`)
- Telnet console now binds to 127.0.0.1 by default (:issue:`699`)
- Update Debian/Ubuntu install instructions (:issue:`509`, :issue:`549`)
- Disable smart strings in lxml XPath evaluations (:issue:`535`)
- Restore filesystem based cache as default for http
  cache middleware (:issue:`541`, :issue:`500`, :issue:`571`)
- Expose current crawler in Scrapy shell (:issue:`557`)
- Improve testsuite comparing CSV and XML exporters (:issue:`570`)
- New ``offsite/filtered`` and ``offsite/domains`` stats (:issue:`566`)
- Support process_links as generator in CrawlSpider (:issue:`555`)
- Verbose logging and new stats counters for DupeFilter (:issue:`553`)
- Add a mimetype parameter to ``MailSender.send()`` (:issue:`602`)
- Generalize file pipeline log messages (:issue:`622`)
- Replace unencodeable codepoints with html entities in SGMLLinkExtractor (:issue:`565`)
- Converted SEP documents to rst format (:issue:`629`, :issue:`630`,
  :issue:`638`, :issue:`632`, :issue:`636`, :issue:`640`, :issue:`635`,
  :issue:`634`, :issue:`639`, :issue:`637`, :issue:`631`, :issue:`633`,
  :issue:`641`, :issue:`642`)
- Tests and docs for clickdata's nr index in FormRequest (:issue:`646`, :issue:`645`)
- Allow to disable a downloader handler just like any other component (:issue:`650`)
- Log when a request is discarded after too many redirections (:issue:`654`)
- Log error responses if they are not handled by spider callbacks
  (:issue:`612`, :issue:`656`)
- Add content-type check to http compression mw (:issue:`193`, :issue:`660`)
- Run pypy tests using latest pypi from ppa (:issue:`674`)
- Run test suite using pytest instead of trial (:issue:`679`)
- Build docs and check for dead links in tox environment (:issue:`687`)
- Make scrapy.version_info a tuple of integers (:issue:`681`, :issue:`692`)
- Infer exporter's output format from filename extensions
  (:issue:`546`, :issue:`659`, :issue:`760`)
- Support case-insensitive domains in ``url_is_from_any_domain()`` (:issue:`693`)
- Remove pep8 warnings in project and spider templates (:issue:`698`)
- Tests and docs for ``request_fingerprint`` function (:issue:`597`)
- Update SEP-19 for GSoC project ``per-spider settings`` (:issue:`705`)
- Set exit code to non-zero when contracts fails (:issue:`727`)
- Add a setting to control what class is instantiated as Downloader component
  (:issue:`738`)
- Pass response in ``item_dropped`` signal (:issue:`724`)
- Improve ``scrapy check`` contracts command (:issue:`733`, :issue:`752`)
- Document ``spider.closed()`` shortcut (:issue:`719`)
- Document ``request_scheduled`` signal (:issue:`746`)
- Add a note about reporting security issues (:issue:`697`)
- Add LevelDB http cache storage backend (:issue:`626`, :issue:`500`)
- Sort spider list output of ``scrapy list`` command (:issue:`742`)
- Multiple documentation enhancements and fixes
  (:issue:`575`, :issue:`587`, :issue:`590`, :issue:`596`, :issue:`610`,
  :issue:`617`, :issue:`618`, :issue:`627`, :issue:`613`, :issue:`643`,
  :issue:`654`, :issue:`675`, :issue:`663`, :issue:`711`, :issue:`714`)

Bugfixes
~~~~~~~~

- Encode unicode URL value when creating Links in RegexLinkExtractor (:issue:`561`)
- Ignore None values in ItemLoader processors (:issue:`556`)
- Fix link text when there is an inner tag in SGMLLinkExtractor and
  HtmlParserLinkExtractor (:issue:`485`, :issue:`574`)
- Fix wrong checks on subclassing of deprecated classes
  (:issue:`581`, :issue:`584`)
- Handle errors caused by inspect.stack() failures (:issue:`582`)
- Fix a reference to unexistent engine attribute (:issue:`593`, :issue:`594`)
- Fix dynamic itemclass example usage of type() (:issue:`603`)
- Use lucasdemarchi/codespell to fix typos (:issue:`628`)
- Fix default value of attrs argument in SgmlLinkExtractor to be tuple (:issue:`661`)
- Fix XXE flaw in sitemap reader (:issue:`676`)
- Fix engine to support filtered start requests (:issue:`707`)
- Fix offsite middleware case on urls with no hostnames (:issue:`745`)
- Testsuite doesn't require PIL anymore (:issue:`585`)

Scrapy 0.22.2 (released 2014-02-14)
-----------------------------------

- fix a reference to unexistent engine.slots. closes #593 (:commit:`13c099a`)
- downloaderMW doc typo (spiderMW doc copy remnant) (:commit:`8ae11bf`)
- Correct typos (:commit:`1346037`)

Scrapy 0.22.1 (released 2014-02-08)
-----------------------------------

- localhost666 can resolve under certain circumstances (:commit:`2ec2279`)
- test inspect.stack failure (:commit:`cc3eda3`)
- Handle cases when inspect.stack() fails (:commit:`8cb44f9`)
- Fix wrong checks on subclassing of deprecated classes. closes #581 (:commit:`46d98d6`)
- Docs: 4-space indent for final spider example (:commit:`13846de`)
- Fix HtmlParserLinkExtractor and tests after #485 merge (:commit:`368a946`)
- BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag (:commit:`b566388`)
- BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (:commit:`c1cb418`)
- BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (:commit:`7e4d627`)
- Fix tests for Travis-CI build (:commit:`76c7e20`)
- replace unencodeable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`)
- RegexLinkExtractor: encode URL unicode value when creating Links (:commit:`d0ee545`)
- Updated the tutorial crawl output with latest output. (:commit:`8da65de`)
- Updated shell docs with the crawler reference and fixed the actual shell output. (:commit:`875b9ab`)
- PEP8 minor edits. (:commit:`f89efaf`)
- Expose current crawler in the Scrapy shell. (:commit:`5349cec`)
- Unused re import and PEP8 minor edits. (:commit:`387f414`)
- Ignore None's values when using the ItemLoader. (:commit:`0632546`)
- DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm. (:commit:`cde9a8c`)
- show Ubuntu setup instructions as literal code (:commit:`fb5c9c5`)
- Update Ubuntu installation instructions (:commit:`70fb105`)
- Merge pull request #550 from stray-leone/patch-1 (:commit:`6f70b6a`)
- modify the version of Scrapy Ubuntu package (:commit:`725900d`)
- fix 0.22.0 release date (:commit:`af0219a`)
- fix typos in news.rst and remove (not released yet) header (:commit:`b7f58f4`)

Scrapy 0.22.0 (released 2014-01-17)
-----------------------------------

Enhancements
~~~~~~~~~~~~

- [**Backward incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`)
  To restore old backend set ``HTTPCACHE_STORAGE`` to ``scrapy.contrib.httpcache.DbmCacheStorage``
- Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`)
- Add a middleware to crawl ajax crawlable pages as defined by google (:issue:`343`)
- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`)
- Selectors register EXSLT namespaces by default (:issue:`472`)
- Unify item loaders similar to selectors renaming (:issue:`461`)
- Make ``RFPDupeFilter`` class easily subclassable (:issue:`533`)
- Improve test coverage and forthcoming Python 3 support (:issue:`525`)
- Promote startup info on settings and middleware to INFO level (:issue:`520`)
- Support partials in ``get_func_args`` util (:issue:`506`, :issue:`504`)
- Allow running individual tests via tox (:issue:`503`)
- Update extensions ignored by link extractors (:issue:`498`)
- Add middleware methods to get files/images/thumbs paths (:issue:`490`)
- Improve offsite middleware tests (:issue:`478`)
- Add a way to skip default Referer header set by RefererMiddleware (:issue:`475`)
- Do not send ``x-gzip`` in default ``Accept-Encoding`` header (:issue:`469`)
- Support defining http error handling using settings (:issue:`466`)
- Use modern python idioms wherever you find legacies (:issue:`497`)
- Improve and correct documentation
  (:issue:`527`, :issue:`524`, :issue:`521`, :issue:`517`, :issue:`512`, :issue:`505`,
  :issue:`502`, :issue:`489`, :issue:`465`, :issue:`460`, :issue:`425`, :issue:`536`)

Fixes
~~~~~

- Update Selector class imports in CrawlSpider template (:issue:`484`)
- Fix unexistent reference to ``engine.slots`` (:issue:`464`)
- Do not try to call ``body_as_unicode()`` on a non-TextResponse instance (:issue:`462`)
- Warn when subclassing XPathItemLoader, previously it only warned on
  instantiation. (:issue:`523`)
- Warn when subclassing XPathSelector, previously it only warned on
  instantiation. (:issue:`537`)
- Multiple fixes to memory stats (:issue:`531`, :issue:`530`, :issue:`529`)
- Fix overriding url in ``FormRequest.from_response()`` (:issue:`507`)
- Fix tests runner under pip 1.5 (:issue:`513`)
- Fix logging error when spider name is unicode (:issue:`479`)

Scrapy 0.20.2 (released 2013-12-09)
-----------------------------------

- Update CrawlSpider Template with Selector changes (:commit:`6d1457d`)
- fix method name in tutorial. closes GH-480 (:commit:`b4fc359`

Scrapy 0.20.1 (released 2013-11-28)
-----------------------------------

- include_package_data is required to build wheels from published sources (:commit:`5ba1ad5`)
- process_parallel was leaking the failures on its internal deferreds.  closes #458 (:commit:`419a780`)

Scrapy 0.20.0 (released 2013-11-08)
-----------------------------------

Enhancements
~~~~~~~~~~~~

- New Selector's API including CSS selectors (:issue:`395` and :issue:`426`),
- Request/Response url/body attributes are now immutable
  (modifying them had been deprecated for a long time)
- :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list)
- Sitemap spider can fetch alternate URLs (:issue:`360`)
- ``Selector.remove_namespaces()`` now remove namespaces from element's attributes. (:issue:`416`)
- Paved the road for Python 3.3+ (:issue:`435`, :issue:`436`, :issue:`431`, :issue:`452`)
- New item exporter using native python types with nesting support (:issue:`366`)
- Tune HTTP1.1 pool size so it matches concurrency defined by settings (:commit:`b43b5f575`)
- scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (:issue:`327`)
- New FilesPipeline with functionality factored out from ImagesPipeline (:issue:`370`, :issue:`409`)
- Recommend Pillow instead of PIL for image handling (:issue:`317`)
- Added Debian packages for Ubuntu Quantal and Raring (:commit:`86230c0`)
- Mock server (used for tests) can listen for HTTPS requests (:issue:`410`)
- Remove multi spider support from multiple core components
  (:issue:`422`, :issue:`421`, :issue:`420`, :issue:`419`, :issue:`423`, :issue:`418`)
- Travis-CI now tests Scrapy changes against development versions of ``w3lib`` and ``queuelib`` python packages.
- Add pypy 2.1 to continuous integration tests (:commit:`ecfa7431`)
- Pylinted, pep8 and removed old-style exceptions from source (:issue:`430`, :issue:`432`)
- Use importlib for parametric imports (:issue:`445`)
- Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (:issue:`372`)
- Bugfix crawling shutdown on SIGINT (:issue:`450`)
- Do not submit ``reset`` type inputs in FormRequest.from_response (:commit:`b326b87`)
- Do not silence download errors when request errback raises an exception (:commit:`684cfc0`)

Bugfixes
~~~~~~~~

- Fix tests under Django 1.6 (:commit:`b6bed44c`)
- Lot of bugfixes to retry middleware under disconnections using HTTP 1.1 download handler
- Fix inconsistencies among Twisted releases (:issue:`406`)
- Fix Scrapy shell bugs (:issue:`418`, :issue:`407`)
- Fix invalid variable name in setup.py (:issue:`429`)
- Fix tutorial references (:issue:`387`)
- Improve request-response docs (:issue:`391`)
- Improve best practices docs (:issue:`399`, :issue:`400`, :issue:`401`, :issue:`402`)
- Improve django integration docs (:issue:`404`)
- Document ``bindaddress`` request meta (:commit:`37c24e01d7`)
- Improve ``Request`` class documentation (:issue:`226`)

Other
~~~~~

- Dropped Python 2.6 support (:issue:`448`)
- Add :doc:`cssselect <cssselect:index>` python package as install dependency
- Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on.
- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
- Running test suite now requires ``mock`` python library (:issue:`390`)

Thanks
~~~~~~

Thanks to everyone who contribute to this release!

List of contributors sorted by number of commits::

     69 Daniel Graña <dangra@...>
     37 Pablo Hoffman <pablo@...>
     13 Mikhail Korobov <kmike84@...>
      9 Alex Cepoi <alex.cepoi@...>
      9 alexanderlukanin13 <alexander.lukanin.13@...>
      8 Rolando Espinoza La fuente <darkrho@...>
      8 Lukasz Biedrycki <lukasz.biedrycki@...>
      6 Nicolas Ramirez <nramirez.uy@...>
      3 Paul Tremberth <paul.tremberth@...>
      2 Martin Olveyra <molveyra@...>
      2 Stefan <misc@...>
      2 Rolando Espinoza <darkrho@...>
      2 Loren Davie <loren@...>
      2 irgmedeiros <irgmedeiros@...>
      1 Stefan Koch <taikano@...>
      1 Stefan <cct@...>
      1 scraperdragon <dragon@...>
      1 Kumara Tharmalingam <ktharmal@...>
      1 Francesco Piccinno <stack.box@...>
      1 Marcos Campal <duendex@...>
      1 Dragon Dave <dragon@...>
      1 Capi Etheriel <barraponto@...>
      1 cacovsky <amarquesferraz@...>
      1 Berend Iwema <berend@...>

Scrapy 0.18.4 (released 2013-10-10)
-----------------------------------

- IPython refuses to update the namespace. fix #396 (:commit:`3d32c4f`)
- Fix AlreadyCalledError replacing a request in shell command. closes #407 (:commit:`b1d8919`)
- Fix ``start_requests()`` laziness and early hangs (:commit:`89faf52`)

Scrapy 0.18.3 (released 2013-10-03)
-----------------------------------

- fix regression on lazy evaluation of start requests (:commit:`12693a5`)
- forms: do not submit reset inputs (:commit:`e429f63`)
- increase unittest timeouts to decrease travis false positive failures (:commit:`912202e`)
- backport master fixes to json exporter (:commit:`cfc2d46`)
- Fix permission and set umask before generating sdist tarball (:commit:`06149e0`)

Scrapy 0.18.2 (released 2013-09-03)
-----------------------------------

- Backport ``scrapy check`` command fixes and backward compatible multi
  crawler process(:issue:`339`)

Scrapy 0.18.1 (released 2013-08-27)
-----------------------------------

- remove extra import added by cherry picked changes (:commit:`d20304e`)
- fix crawling tests under twisted pre 11.0.0 (:commit:`1994f38`)
- py26 can not format zero length fields {} (:commit:`abf756f`)
- test PotentiaDataLoss errors on unbound responses (:commit:`b15470d`)
- Treat responses without content-length or Transfer-Encoding as good responses (:commit:`c4bf324`)
- do no include ResponseFailed if http11 handler is not enabled (:commit:`6cbe684`)
- New HTTP client wraps connection lost in ResponseFailed exception. fix #373 (:commit:`1a20bba`)
- limit travis-ci build matrix (:commit:`3b01bb8`)
- Merge pull request #375 from peterarenot/patch-1 (:commit:`fa766d7`)
- Fixed so it refers to the correct folder (:commit:`3283809`)
- added Quantal & Raring to support Ubuntu releases (:commit:`1411923`)
- fix retry middleware which didn't retry certain connection errors after the upgrade to http1 client, closes GH-373 (:commit:`bb35ed0`)
- fix XmlItemExporter in Python 2.7.4 and 2.7.5 (:commit:`de3e451`)
- minor updates to 0.18 release notes (:commit:`c45e5f1`)
- fix contributors list format (:commit:`0b60031`)

Scrapy 0.18.0 (released 2013-08-09)
-----------------------------------

- Lot of improvements to testsuite run using Tox, including a way to test on pypi
- Handle GET parameters for AJAX crawlable urls (:commit:`3fe2a32`)
- Use lxml recover option to parse sitemaps (:issue:`347`)
- Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
- Support disabling ``HttpCompressionMiddleware`` using a flag setting (:issue:`359`)
- Support xml namespaces using ``iternodes`` parser in ``XMLFeedSpider`` (:issue:`12`)
- Support ``dont_cache`` request meta flag (:issue:`19`)
- Bugfix ``scrapy.utils.gz.gunzip`` broken by changes in python 2.7.4 (:commit:`4dc76e`)
- Bugfix url encoding on ``SgmlLinkExtractor`` (:issue:`24`)
- Bugfix ``TakeFirst`` processor shouldn't discard zero (0) value (:issue:`59`)
- Support nested items in xml exporter (:issue:`66`)
- Improve cookies handling performance (:issue:`77`)
- Log dupe filtered requests once (:issue:`105`)
- Split redirection middleware into status and meta based middlewares (:issue:`78`)
- Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`)
- Support xpath form selection on ``FormRequest.from_response`` (:issue:`185`)
- Bugfix unicode decoding error on ``SgmlLinkExtractor`` (:issue:`199`)
- Bugfix signal dispatching on pypi interpreter (:issue:`205`)
- Improve request delay and concurrency handling (:issue:`206`)
- Add RFC2616 cache policy to ``HttpCacheMiddleware`` (:issue:`212`)
- Allow customization of messages logged by engine (:issue:`214`)
- Multiples improvements to ``DjangoItem`` (:issue:`217`, :issue:`218`, :issue:`221`)
- Extend Scrapy commands using setuptools entry points (:issue:`260`)
- Allow spider ``allowed_domains`` value to be set/tuple (:issue:`261`)
- Support ``settings.getdict`` (:issue:`269`)
- Simplify internal ``scrapy.core.scraper`` slot handling (:issue:`271`)
- Added ``Item.copy`` (:issue:`290`)
- Collect idle downloader slots (:issue:`297`)
- Add ``ftp://`` scheme downloader handler (:issue:`329`)
- Added downloader benchmark webserver and spider tools :ref:`benchmarking`
- Moved persistent (on disk) queues to a separate project (queuelib_) which Scrapy now depends on
- Add Scrapy commands using external libraries (:issue:`260`)
- Added ``--pdb`` option to ``scrapy`` command line tool
- Added :meth:`XPathSelector.remove_namespaces <scrapy.Selector.remove_namespaces>` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
- Several improvements to spider contracts
- New default middleware named MetaRefreshMiddleware that handles meta-refresh html tag redirections,
- MetaRefreshMiddleware and RedirectMiddleware have different priorities to address #62
- added from_crawler method to spiders
- added system tests with mock server
- more improvements to macOS compatibility (thanks Alex Cepoi)
- several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez)
- support custom download slots
- added --spider option to "shell" command.
- log overridden settings when Scrapy starts

Thanks to everyone who contribute to this release. Here is a list of
contributors sorted by number of commits::

    130 Pablo Hoffman <pablo@...>
     97 Daniel Graña <dangra@...>
     20 Nicolás Ramírez <nramirez.uy@...>
     13 Mikhail Korobov <kmike84@...>
     12 Pedro Faustino <pedrobandim@...>
     11 Steven Almeroth <sroth77@...>
      5 Rolando Espinoza La fuente <darkrho@...>
      4 Michal Danilak <mimino.coder@...>
      4 Alex Cepoi <alex.cepoi@...>
      4 Alexandr N Zamaraev (aka tonal) <tonal@...>
      3 paul <paul.tremberth@...>
      3 Martin Olveyra <molveyra@...>
      3 Jordi Llonch <llonchj@...>
      3 arijitchakraborty <myself.arijit@...>
      2 Shane Evans <shane.evans@...>
      2 joehillen <joehillen@...>
      2 Hart <HartSimha@...>
      2 Dan <ellisd23@...>
      1 Zuhao Wan <wanzuhao@...>
      1 whodatninja <blake@...>
      1 vkrest <v.krestiannykov@...>
      1 tpeng <pengtaoo@...>
      1 Tom Mortimer-Jones <tom@...>
      1 Rocio Aramberri <roschegel@...>
      1 Pedro <pedro@...>
      1 notsobad <wangxiaohugg@...>
      1 Natan L <kuyanatan.nlao@...>
      1 Mark Grey <mark.grey@...>
      1 Luan <luanpab@...>
      1 Libor Nenadál <libor.nenadal@...>
      1 Juan M Uys <opyate@...>
      1 Jonas Brunsgaard <jonas.brunsgaard@...>
      1 Ilya Baryshev <baryshev@...>
      1 Hasnain Lakhani <m.hasnain.lakhani@...>
      1 Emanuel Schorsch <emschorsch@...>
      1 Chris Tilden <chris.tilden@...>
      1 Capi Etheriel <barraponto@...>
      1 cacovsky <amarquesferraz@...>
      1 Berend Iwema <berend@...>

Scrapy 0.16.5 (released 2013-05-30)
-----------------------------------

- obey request method when Scrapy deploy is redirected to a new endpoint (:commit:`8c4fcee`)
- fix inaccurate downloader middleware documentation. refs #280 (:commit:`40667cb`)
- doc: remove links to diveintopython.org, which is no longer available. closes #246 (:commit:`bd58bfa`)
- Find form nodes in invalid html5 documents (:commit:`e3d6945`)
- Fix typo labeling attrs type bool instead of list (:commit:`a274276`)

Scrapy 0.16.4 (released 2013-01-23)
-----------------------------------

- fixes spelling errors in documentation (:commit:`6d2b3aa`)
- add doc about disabling an extension. refs #132 (:commit:`c90de33`)
- Fixed error message formatting. log.err() doesn't support cool formatting and when error occurred, the message was:    "ERROR: Error processing %(item)s" (:commit:`c16150c`)
- lint and improve images pipeline error logging (:commit:`56b45fc`)
- fixed doc typos (:commit:`243be84`)
- add documentation topics: Broad Crawls & Common Practices (:commit:`1fbb715`)
- fix bug in Scrapy parse command when spider is not specified explicitly. closes #209 (:commit:`c72e682`)
- Update docs/topics/commands.rst (:commit:`28eac7a`)

Scrapy 0.16.3 (released 2012-12-07)
-----------------------------------

- Remove concurrency limitation when using download delays and still ensure inter-request delays are enforced (:commit:`487b9b5`)
- add error details when image pipeline fails (:commit:`8232569`)
- improve macOS compatibility (:commit:`8dcf8aa`)
- setup.py: use README.rst to populate long_description (:commit:`7b5310d`)
- doc: removed obsolete references to ClientForm (:commit:`80f9bb6`)
- correct docs for default storage backend (:commit:`2aa491b`)
- doc: removed broken proxyhub link from FAQ (:commit:`bdf61c4`)
- Fixed docs typo in SpiderOpenCloseLogging example (:commit:`7184094`)

Scrapy 0.16.2 (released 2012-11-09)
-----------------------------------

- Scrapy contracts: python2.6 compat (:commit:`a4a9199`)
- Scrapy contracts verbose option (:commit:`ec41673`)
- proper unittest-like output for Scrapy contracts (:commit:`86635e4`)
- added open_in_browser to debugging doc (:commit:`c9b690d`)
- removed reference to global Scrapy stats from settings doc (:commit:`dd55067`)
- Fix SpiderState bug in Windows platforms (:commit:`58998f4`)

Scrapy 0.16.1 (released 2012-10-26)
-----------------------------------

- fixed LogStats extension, which got broken after a wrong merge before the 0.16 release (:commit:`8c780fd`)
- better backward compatibility for scrapy.conf.settings (:commit:`3403089`)
- extended documentation on how to access crawler stats from extensions (:commit:`c4da0b5`)
- removed .hgtags (no longer needed now that Scrapy uses git) (:commit:`d52c188`)
- fix dashes under rst headers (:commit:`fa4f7f9`)
- set release date for 0.16.0 in news (:commit:`e292246`)

Scrapy 0.16.0 (released 2012-10-18)
-----------------------------------

Scrapy changes:

- added :ref:`topics-contracts`, a mechanism for testing spiders in a formal/reproducible way
- added options ``-o`` and ``-t`` to the :command:`runspider` command
- documented :doc:`topics/autothrottle` and added to extensions installed by default. You still need to enable it with :setting:`AUTOTHROTTLE_ENABLED`
- major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (``stats_spider_opened``, etc). Stats are much simpler now, backward compatibility is kept on the Stats Collector API and signals.
- added a ``process_start_requests()`` method to spider middlewares
- dropped Signals singleton. Signals should now be accessed through the Crawler.signals attribute. See the signals documentation for more info.
- dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
- documented :ref:`topics-api`
- ``lxml`` is now the default selectors backend instead of ``libxml2``
- ported FormRequest.from_response() to use `lxml`_ instead of `ClientForm`_
- removed modules: ``scrapy.xlib.BeautifulSoup`` and ``scrapy.xlib.ClientForm``
- SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`)
- StackTraceDump extension: also dump trackref live references (:commit:`fe2ce93`)
- nested items now fully supported in JSON and JSONLines exporters
- added :reqmeta:`cookiejar` Request meta key to support multiple cookie sessions per spider
- decoupled encoding detection code to `w3lib.encoding`_, and ported Scrapy code to use that module
- dropped support for Python 2.5. See https://www.zyte.com/blog/scrapy-0-15-dropping-support-for-python-2-5/
- dropped support for Twisted 2.5
- added :setting:`REFERER_ENABLED` setting, to control referer middleware
- changed default user agent to: ``Scrapy/VERSION (+http://scrapy.org)``
- removed (undocumented) ``HTMLImageLinkExtractor`` class from ``scrapy.contrib.linkextractors.image``
- removed per-spider settings (to be replaced by instantiating multiple crawler objects)
- ``USER_AGENT`` spider attribute will no longer work, use ``user_agent`` attribute instead
- ``DOWNLOAD_TIMEOUT`` spider attribute will no longer work, use ``download_timeout`` attribute instead
- removed ``ENCODING_ALIASES`` setting, as encoding auto-detection has been moved to the `w3lib`_ library
- promoted :ref:`topics-djangoitem` to main contrib
- LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`)
- downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the ``__init__`` method
- replaced memory usage accounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module
- removed signal: ``scrapy.mail.mail_sent``
- removed ``TRACK_REFS`` setting, now :ref:`trackrefs <topics-leaks-trackrefs>` is always enabled
- DBM is now the default storage backend for HTTP cache middleware
- number of log messages (per level) are now tracked through Scrapy stats (stat name: ``log_count/LEVEL``)
- number received responses are now tracked through Scrapy stats (stat name: ``response_received_count``)
- removed ``scrapy.log.started`` attribute

Scrapy 0.14.4
-------------

- added precise to supported Ubuntu distros (:commit:`b7e46df`)
- fixed bug in json-rpc webservice reported in https://groups.google.com/forum/#!topic/scrapy-users/qgVBmFybNAQ/discussion. also removed no longer supported 'run' command from extras/scrapy-ws.py (:commit:`340fbdb`)
- meta tag attributes for content-type http equiv can be in any order. #123 (:commit:`0cb68af`)
- replace "import Image" by more standard "from PIL import Image". closes #88 (:commit:`4d17048`)
- return trial status as bin/runtests.sh exit value. #118 (:commit:`b7b2e7f`)

Scrapy 0.14.3
-------------

- forgot to include pydispatch license. #118 (:commit:`fd85f9c`)
- include egg files used by testsuite in source distribution. #118 (:commit:`c897793`)
- update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (:commit:`2548dcc`)
- added note to docs/topics/firebug.rst about google directory being shut down (:commit:`668e352`)
- don't discard slot when empty, just save in another dict in order to recycle if needed again. (:commit:`8e9f607`)
- do not fail handling unicode xpaths in libxml2 backed selectors (:commit:`b830e95`)
- fixed minor mistake in Request objects documentation (:commit:`bf3c9ee`)
- fixed minor defect in link extractors documentation (:commit:`ba14f38`)
- removed some obsolete remaining code related to sqlite support in Scrapy (:commit:`0665175`)

Scrapy 0.14.2
-------------

- move buffer pointing to start of file before computing checksum. refs #92 (:commit:`6a5bef2`)
- Compute image checksum before persisting images. closes #92 (:commit:`9817df1`)
- remove leaking references in cached failures (:commit:`673a120`)
- fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (:commit:`11133e9`)
- fixed struct.error on http compression middleware. closes #87 (:commit:`1423140`)
- ajax crawling wasn't expanding for unicode urls (:commit:`0de3fb4`)
- Catch ``start_requests()`` iterator errors. refs #83 (:commit:`454a21d`)
- Speed-up libxml2 XPathSelector (:commit:`2fbd662`)
- updated versioning doc according to recent changes (:commit:`0a070f5`)
- scrapyd: fixed documentation link (:commit:`2b4e4c3`)
- extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`)

Scrapy 0.14.1
-------------

- extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`)
- bumped version to 0.14.1 (:commit:`6cb9e1c`)
- fixed reference to tutorial directory (:commit:`4b86bd6`)
- doc: removed duplicated callback argument from Request.replace() (:commit:`1aeccdd`)
- fixed formatting of scrapyd doc (:commit:`8bf19e6`)
- Dump stacks for all running threads and fix engine status dumped by StackTraceDump extension (:commit:`14a8e6e`)
- added comment about why we disable ssl on boto images upload (:commit:`5223575`)
- SSL handshaking hangs when doing too many parallel connections to S3 (:commit:`63d583d`)
- change tutorial to follow changes on dmoz site (:commit:`bcb3198`)
- Avoid _disconnectedDeferred AttributeError exception in Twisted>=11.1.0 (:commit:`98f3f87`)
- allow spider to set autothrottle max concurrency (:commit:`175a4b5`)

Scrapy 0.14
-----------

New features and settings
~~~~~~~~~~~~~~~~~~~~~~~~~

- Support for AJAX crawlable urls
- New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`)
- added ``-o`` option to ``scrapy crawl``, a shortcut for dumping scraped items into a file (or standard output using ``-``)
- Added support for passing custom settings to Scrapyd ``schedule.json`` api (:rev:`2779`, :rev:`2783`)
- New ``ChunkedTransferMiddleware`` (enabled by default) to support `chunked transfer encoding`_ (:rev:`2769`)
- Add boto 2.0 support for S3 downloader handler (:rev:`2763`)
- Added `marshal`_ to formats supported by feed exports (:rev:`2744`)
- In request errbacks, offending requests are now received in ``failure.request`` attribute (:rev:`2738`)
- Big downloader refactoring to support per domain/ip concurrency limits (:rev:`2732`)
   - ``CONCURRENT_REQUESTS_PER_SPIDER`` setting has been deprecated and replaced by:
      - :setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`, :setting:`CONCURRENT_REQUESTS_PER_IP`
   - check the documentation for more details
- Added builtin caching DNS resolver (:rev:`2728`)
- Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`)
- Moved spider queues to scrapyd: ``scrapy.spiderqueue`` -> ``scrapyd.spiderqueue`` (:rev:`2708`)
- Moved sqlite utils to scrapyd: ``scrapy.utils.sqlite`` -> ``scrapyd.sqlite`` (:rev:`2781`)
- Real support for returning iterators on ``start_requests()`` method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`)
- Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`)
- Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`)
- Added ``CloseSpider`` exception to manually close spiders (:rev:`2691`)
- Improved encoding detection by adding support for HTML5 meta charset declaration (:rev:`2690`)
- Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`)
- Added ``SitemapSpider`` (see documentation in Spiders page) (:rev:`2658`)
- Added ``LogStats`` extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`)
- Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an ``IOError``.
- Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`)
- Added new command to edit spiders: ``scrapy edit`` (:rev:`2636`) and ``-e`` flag to ``genspider`` command that uses it (:rev:`2653`)
- Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
- Added :signal:`spider_error` signal (:rev:`2628`)
- Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`)
- Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to ``True``). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
- Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`)
- Added new DBM HTTP cache storage backend (:rev:`2576`)
- Added ``listjobs.json`` API to Scrapyd (:rev:`2571`)
- ``CsvItemExporter``: added ``join_multivalued`` parameter (:rev:`2578`)
- Added namespace support to ``xmliter_lxml`` (:rev:`2552`)
- Improved cookies middleware by making ``COOKIES_DEBUG`` nicer and documenting it (:rev:`2579`)
- Several improvements to Scrapyd and Link extractors

Code rearranged and removed
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (:rev:`2630`)
   - original item_scraped signal was removed
   - original item_passed signal was renamed to item_scraped
   - old log lines ``Scraped Item...`` were removed
   - old log lines ``Passed Item...`` were renamed to ``Scraped Item...`` lines and downgraded to ``DEBUG`` level
- Reduced Scrapy codebase by striping part of Scrapy code into two new libraries:
   - `w3lib`_ (several functions from ``scrapy.utils.{http,markup,multipart,response,url}``, done in :rev:`2584`)
   - `scrapely`_ (was ``scrapy.contrib.ibl``, done in :rev:`2586`)
- Removed unused function: ``scrapy.utils.request.request_info()`` (:rev:`2577`)
- Removed googledir project from ``examples/googledir``. There's now a new example project called ``dirbot`` available on GitHub: https://github.com/scrapy/dirbot
- Removed support for default field values in Scrapy items (:rev:`2616`)
- Removed experimental crawlspider v2 (:rev:`2632`)
- Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe filtering class as before (``DUPEFILTER_CLASS`` setting) (:rev:`2640`)
- Removed support for passing urls to ``scrapy crawl`` command (use ``scrapy parse`` instead) (:rev:`2704`)
- Removed deprecated Execution Queue (:rev:`2704`)
- Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`)
- removed ``CONCURRENT_SPIDERS`` setting (use scrapyd maxproc instead) (:rev:`2789`)
- Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (:rev:`2717`, :rev:`2718`)
- Renamed setting ``CLOSESPIDER_ITEMPASSED`` to :setting:`CLOSESPIDER_ITEMCOUNT` (:rev:`2655`). Backward compatibility kept.

Scrapy 0.12
-----------

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Passed item is now sent in the ``item`` argument of the :signal:`item_passed
  <item_scraped>` (#273)
- Added verbose option to ``scrapy version`` command, useful for bug reports (#298)
- HTTP cache now stored by default in the project data dir (#279)
- Added project data storage directory (#276, #277)
- Documented file structure of Scrapy projects (see command-line tool doc)
- New lxml backend for XPath selectors (#147)
- Per-spider settings (#245)
- Support exit codes to signal errors in Scrapy commands (#248)
- Added ``-c`` argument to ``scrapy shell`` command
- Made ``libxml2`` optional (#260)
- New ``deploy`` command (#261)
- Added :setting:`CLOSESPIDER_PAGECOUNT` setting (#253)
- Added :setting:`CLOSESPIDER_ERRORCOUNT` setting (#254)

Scrapyd changes
~~~~~~~~~~~~~~~

- Scrapyd now uses one process per spider
- It stores one log file per spider run, and rotate them keeping the latest 5 logs per spider (by default)
- A minimal web ui was added, available at http://localhost:6800 by default
- There is now a ``scrapy server`` command to start a Scrapyd server of the current project

Changes to settings
~~~~~~~~~~~~~~~~~~~

- added ``HTTPCACHE_ENABLED`` setting (False by default) to enable HTTP cache middleware
- changed ``HTTPCACHE_EXPIRATION_SECS`` semantics: now zero means "never expire".

Deprecated/obsoleted functionality
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Deprecated ``runserver`` command in favor of ``server`` command which starts a Scrapyd server. See also: Scrapyd changes
- Deprecated ``queue`` command in favor of using Scrapyd ``schedule.json`` API. See also: Scrapyd changes
- Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib)

Scrapy 0.10
-----------

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- New Scrapy service called ``scrapyd`` for deploying Scrapy crawlers in production (#218) (documentation available)
- Simplified Images pipeline usage which doesn't require subclassing your own images pipeline now (#217)
- Scrapy shell now shows the Scrapy log by default (#206)
- Refactored execution queue in a common base code and pluggable backends called "spider queues" (#220)
- New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run.
- Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available)
- Feed exporters with pluggable backends (#197) (documentation available)
- Deferred signals (#193)
- Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195)
- Support for overriding default request headers per spider (#181)
- Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186)
- Split Debian package into two packages - the library and the service (#187)
- Scrapy log refactoring (#188)
- New extension for keeping persistent spider contexts among different runs (#203)
- Added ``dont_redirect`` request.meta key for avoiding redirects (#233)
- Added ``dont_retry`` request.meta key for avoiding retries (#234)

Command-line tool changes
~~~~~~~~~~~~~~~~~~~~~~~~~

- New ``scrapy`` command which replaces the old ``scrapy-ctl.py`` (#199)
  - there is only one global ``scrapy`` command now, instead of one ``scrapy-ctl.py`` per project
  - Added ``scrapy.bat`` script for running more conveniently from Windows
- Added bash completion to command-line tool (#210)
- Renamed command ``start`` to ``runserver`` (#209)

API changes
~~~~~~~~~~~

- ``url`` and ``body`` attributes of Request objects are now read-only (#230)
- ``Request.copy()`` and ``Request.replace()`` now also copies their ``callback`` and ``errback`` attributes (#231)
- Removed ``UrlFilterMiddleware`` from ``scrapy.contrib`` (already disabled by default)
- Offsite middleware doesn't filter out any request coming from a spider that doesn't have a allowed_domains attribute (#225)
- Removed Spider Manager ``load()`` method. Now spiders are loaded in the ``__init__`` method itself.
- Changes to Scrapy Manager (now called "Crawler"):
   - ``scrapy.core.manager.ScrapyManager`` class renamed to ``scrapy.crawler.Crawler``
   - ``scrapy.core.manager.scrapymanager`` singleton moved to ``scrapy.project.crawler``
- Moved module: ``scrapy.contrib.spidermanager`` to ``scrapy.spidermanager``
- Spider Manager singleton moved from ``scrapy.spider.spiders`` to the ``spiders`` attribute of ``scrapy.project.crawler`` singleton.
- moved Stats Collector classes: (#204)
   - ``scrapy.stats.collector.StatsCollector`` to ``scrapy.statscol.StatsCollector``
   - ``scrapy.stats.collector.SimpledbStatsCollector`` to ``scrapy.contrib.statscol.SimpledbStatsCollector``
- default per-command settings are now specified in the ``default_settings`` attribute of command object class (#201)
- changed arguments of Item pipeline ``process_item()`` method from ``(spider, item)`` to ``(item, spider)``
   - backward compatibility kept (with deprecation warning)
- moved ``scrapy.core.signals`` module to ``scrapy.signals``
   - backward compatibility kept (with deprecation warning)
- moved ``scrapy.core.exceptions`` module to ``scrapy.exceptions``
   - backward compatibility kept (with deprecation warning)
- added ``handles_request()`` class method to ``BaseSpider``
- dropped ``scrapy.log.exc()`` function (use ``scrapy.log.err()`` instead)
- dropped ``component`` argument of ``scrapy.log.msg()`` function
- dropped ``scrapy.log.log_level`` attribute
- Added ``from_settings()`` class methods to Spider Manager, and Item Pipeline Manager

Changes to settings
~~~~~~~~~~~~~~~~~~~

- Added ``HTTPCACHE_IGNORE_SCHEMES`` setting to ignore certain schemes on !HttpCacheMiddleware (#225)
- Added ``SPIDER_QUEUE_CLASS`` setting which defines the spider queue to use (#220)
- Added ``KEEP_ALIVE`` setting (#220)
- Removed ``SERVICE_QUEUE`` setting (#220)
- Removed ``COMMANDS_SETTINGS_MODULE`` setting (#201)
- Renamed ``REQUEST_HANDLERS`` to ``DOWNLOAD_HANDLERS`` and make download handlers classes (instead of functions)

Scrapy 0.9
----------

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Added SMTP-AUTH support to scrapy.mail
- New settings added: ``MAIL_USER``, ``MAIL_PASS`` (:rev:`2065` | #149)
- Added new scrapy-ctl view command - To view URL in the browser, as seen by Scrapy (:rev:`2039`)
- Added web service for controlling Scrapy process (this also deprecates the web console. (:rev:`2053` | #167)
- Support for running Scrapy as a service, for production systems (:rev:`1988`, :rev:`2054`, :rev:`2055`, :rev:`2056`, :rev:`2057` | #168)
- Added wrapper induction library (documentation only available in source code for now). (:rev:`2011`)
- Simplified and improved response encoding support (:rev:`1961`, :rev:`1969`)
- Added ``LOG_ENCODING`` setting (:rev:`1956`, documentation available)
- Added ``RANDOMIZE_DOWNLOAD_DELAY`` setting (enabled by default) (:rev:`1923`, doc available)
- ``MailSender`` is no longer IO-blocking (:rev:`1955` | #146)
- Linkextractors and new Crawlspider now handle relative base tag urls (:rev:`1960` | #148)
- Several improvements to Item Loaders and processors (:rev:`2022`, :rev:`2023`, :rev:`2024`, :rev:`2025`, :rev:`2026`, :rev:`2027`, :rev:`2028`, :rev:`2029`, :rev:`2030`)
- Added support for adding variables to telnet console (:rev:`2047` | #165)
- Support for requests without callbacks (:rev:`2050` | #166)

API changes
~~~~~~~~~~~

- Change ``Spider.domain_name`` to ``Spider.name`` (SEP-012, :rev:`1975`)
- ``Response.encoding`` is now the detected encoding (:rev:`1961`)
- ``HttpErrorMiddleware`` now returns None or raises an exception (:rev:`2006` | #157)
- ``scrapy.command`` modules relocation (:rev:`2035`, :rev:`2036`, :rev:`2037`)
- Added ``ExecutionQueue`` for feeding spiders to scrape (:rev:`2034`)
- Removed ``ExecutionEngine`` singleton (:rev:`2039`)
- Ported ``S3ImagesStore`` (images pipeline) to use boto and threads (:rev:`2033`)
- Moved module: ``scrapy.management.telnet`` to ``scrapy.telnet`` (:rev:`2047`)

Changes to default settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Changed default ``SCHEDULER_ORDER`` to ``DFO`` (:rev:`1939`)

Scrapy 0.8
----------

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features
~~~~~~~~~~~~

- Added DEFAULT_RESPONSE_ENCODING setting (:rev:`1809`)
- Added ``dont_click`` argument to ``FormRequest.from_response()`` method (:rev:`1813`, :rev:`1816`)
- Added ``clickdata`` argument to ``FormRequest.from_response()`` method (:rev:`1802`, :rev:`1803`)
- Added support for HTTP proxies (``HttpProxyMiddleware``) (:rev:`1781`, :rev:`1785`)
- Offsite spider middleware now logs messages when filtering out requests (:rev:`1841`)

Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Changed ``scrapy.utils.response.get_meta_refresh()`` signature (:rev:`1804`)
- Removed deprecated ``scrapy.item.ScrapedItem`` class - use ``scrapy.item.Item instead`` (:rev:`1838`)
- Removed deprecated ``scrapy.xpath`` module - use ``scrapy.selector`` instead. (:rev:`1836`)
- Removed deprecated ``core.signals.domain_open`` signal - use ``core.signals.domain_opened`` instead (:rev:`1822`)
- ``log.msg()`` now receives a ``spider`` argument (:rev:`1822`)
   - Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the ``spider`` argument and pass spider references. If you really want to pass a string, use the ``component`` argument instead.
- Changed core signals ``domain_opened``, ``domain_closed``, ``domain_idle``
- Changed Item pipeline to use spiders instead of domains
   -  The ``domain`` argument of  ``process_item()`` item pipeline method was changed to  ``spider``, the new signature is: ``process_item(spider, item)`` (:rev:`1827` | #105)
   - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``.
- Changed Stats API to use spiders instead of domains (:rev:`1849` | #113)
   - ``StatsCollector`` was changed to receive spider references (instead of domains) in its methods (``set_value``, ``inc_value``, etc).
   - added ``StatsCollector.iter_spider_stats()`` method
   - removed ``StatsCollector.list_domains()`` method
   - Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes:
   - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``. ``spider_stats`` contains exactly the same data as ``domain_stats``.
- ``CloseDomain`` extension moved to ``scrapy.contrib.closespider.CloseSpider`` (:rev:`1833`)
   - Its settings were also renamed:
      - ``CLOSEDOMAIN_TIMEOUT`` to ``CLOSESPIDER_TIMEOUT``
      - ``CLOSEDOMAIN_ITEMCOUNT`` to ``CLOSESPIDER_ITEMCOUNT``
- Removed deprecated ``SCRAPYSETTINGS_MODULE`` environment variable - use ``SCRAPY_SETTINGS_MODULE`` instead (:rev:`1840`)
- Renamed setting: ``REQUESTS_PER_DOMAIN`` to ``CONCURRENT_REQUESTS_PER_SPIDER`` (:rev:`1830`, :rev:`1844`)
- Renamed setting: ``CONCURRENT_DOMAINS`` to ``CONCURRENT_SPIDERS`` (:rev:`1830`)
- Refactored HTTP Cache middleware
- HTTP Cache middleware has been heavily refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
- Renamed exception: ``DontCloseDomain`` to ``DontCloseSpider`` (:rev:`1859` | #120)
- Renamed extension: ``DelayedCloseDomain`` to ``SpiderCloseDelay`` (:rev:`1861` | #121)
- Removed obsolete ``scrapy.utils.markup.remove_escape_chars`` function - use ``scrapy.utils.markup.replace_escape_chars`` instead (:rev:`1865`)

Scrapy 0.7
----------

First release of Scrapy.

.. _boto3: https://github.com/boto/boto3
.. _botocore: https://github.com/boto/botocore
.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
.. _ClientForm: https://pypi.org/project/ClientForm/
.. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request
.. _cryptography: https://cryptography.io/en/latest/
.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring
.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt
.. _LevelDB: https://github.com/google/leveldb
.. _lxml: https://lxml.de/
.. _marshal: https://docs.python.org/2/library/marshal.html
.. _parsel: https://github.com/scrapy/parsel
.. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator
.. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator
.. _parsel.csstranslator.XPathExpr: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.XPathExpr
.. _PEP 257: https://peps.python.org/pep-0257/
.. _Pillow: https://github.com/python-pillow/Pillow
.. _pyOpenSSL: https://www.pyopenssl.org/en/stable/
.. _queuelib: https://github.com/scrapy/queuelib
.. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml
.. _resource: https://docs.python.org/2/library/resource.html
.. _robots.txt: https://www.robotstxt.org/
.. _scrapely: https://github.com/scrapy/scrapely
.. _scrapy-bench: https://github.com/scrapy/scrapy-bench
.. _service_identity: https://service-identity.readthedocs.io/en/stable/
.. _six: https://six.readthedocs.io/
.. _tox: https://pypi.org/project/tox/
.. _Twisted: https://twisted.org/
.. _w3lib: https://github.com/scrapy/w3lib
.. _w3lib.encoding: https://github.com/scrapy/w3lib/blob/master/w3lib/encoding.py
.. _What is cacheable: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
.. _zope.interface: https://zopeinterface.readthedocs.io/en/latest/
.. _Zsh: https://www.zsh.org/
.. _zstandard: https://pypi.org/project/zstandard/


.. _topics-contributing:

======================
Contributing to Scrapy
======================

.. important::

    Double check that you are reading the most recent version of this document
    at https://docs.scrapy.org/en/master/contributing.html

    By participating in this project you agree to abide by the terms of our
    `Code of Conduct
    <https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md>`_. Please
    report unacceptable behavior to opensource@zyte.com.

There are many ways to contribute to Scrapy. Here are some of them:

* Report bugs and request features in the `issue tracker`_, trying to follow
  the guidelines detailed in `Reporting bugs`_ below.

* Submit patches for new functionalities and/or bug fixes. Please read
  :ref:`writing-patches` and `Submitting patches`_ below for details on how to
  write and submit a patch.

* Blog about Scrapy. Tell the world how you're using Scrapy. This will help
  newcomers with more examples and will help the Scrapy project to increase its
  visibility.

* Join the `Scrapy subreddit`_ and share your ideas on how to
  improve Scrapy. We're always open to suggestions.

* Answer Scrapy questions at
  `Stack Overflow <https://stackoverflow.com/questions/tagged/scrapy>`__.

Reporting bugs
==============

.. note::

    Please report security issues **only** to
    scrapy-security@googlegroups.com. This is a private list only open to
    trusted Scrapy developers, and its archives are not public.

Well-written bug reports are very helpful, so keep in mind the following
guidelines when you're going to report a new bug.

* check the :ref:`FAQ <faq>` first to see if your issue is addressed in a
  well-known question

* if you have a general question about Scrapy usage, please ask it at
  `Stack Overflow <https://stackoverflow.com/questions/tagged/scrapy>`__
  (use "scrapy" tag).

* check the `open issues`_ to see if the issue has already been reported. If it
  has, don't dismiss the report, but check the ticket history and comments. If
  you have additional useful information, please leave a comment, or consider
  :ref:`sending a pull request <writing-patches>` with a fix.

* search the `scrapy-users`_ list and `Scrapy subreddit`_ to see if it has
  been discussed there, or if you're not sure if what you're seeing is a bug.
  You can also ask in the ``#scrapy`` IRC channel.

* write **complete, reproducible, specific bug reports**. The smaller the test
  case, the better. Remember that other developers won't have your project to
  reproduce the bug, so please include all relevant files required to reproduce
  it. See for example StackOverflow's guide on creating a
  `Minimal, Complete, and Verifiable example`_ exhibiting the issue.

* the most awesome way to provide a complete reproducible example is to
  send a pull request which adds a failing test case to the
  Scrapy testing suite (see :ref:`submitting-patches`).
  This is helpful even if you don't have an intention to
  fix the issue yourselves.

* include the output of ``scrapy version -v`` so developers working on your bug
  know exactly which version and platform it occurred on, which is often very
  helpful for reproducing it, or knowing if it was already fixed.

.. _Minimal, Complete, and Verifiable example: https://stackoverflow.com/help/mcve

.. _find-work:

Finding work
============

If you have decided to make a contribution to Scrapy, but you do not know what
to contribute, you have a few options to find pending work:

-   Check out the `contribution GitHub page`_, which lists open issues tagged
    as **good first issue**.

    .. _contribution GitHub page: https://github.com/scrapy/scrapy/contribute

    There are also `help wanted issues`_ but mind that some may require
    familiarity with the Scrapy code base. You can also target any other issue
    provided it is not tagged as **discuss**.

-   If you enjoy writing documentation, there are `documentation issues`_ as
    well, but mind that some may require familiarity with the Scrapy code base
    as well.

    .. _documentation issues: https://github.com/scrapy/scrapy/issues?q=is%3Aissue+is%3Aopen+label%3Adocs+

-   If you enjoy :ref:`writing automated tests <write-tests>`, you can work on
    increasing our `test coverage`_.

-   If you enjoy code cleanup, we welcome fixes for issues detected by our
    static analysis tools. See ``pyproject.toml`` for silenced issues that may
    need addressing.

    Mind that some issues we do not aim to address at all, and usually include
    a comment on them explaining the reason; not to confuse with comments that
    state what the issue is about, for non-descriptive issue codes.

If you have found an issue, make sure you read the entire issue thread before
you ask questions. That includes related issues and pull requests that show up
in the issue thread when the issue is mentioned elsewhere.

We do not assign issues, and you do not need to announce that you are going to
start working on an issue either. If you want to work on an issue, just go
ahead and :ref:`write a patch for it <writing-patches>`.

Do not discard an issue simply because there is an open pull request for it.
Check if open pull requests are active first. And even if some are active, if
you think you can build a better implementation, feel free to create a pull
request with your approach.

If you decide to work on something without an open issue, please:

-   Do not create an issue to work on code coverage or code cleanup, create a
    pull request directly.

-   Do not create both an issue and a pull request right away. Either open an
    issue first to get feedback on whether or not the issue is worth
    addressing, and create a pull request later only if the feedback from the
    team is positive, or create only a pull request, if you think a discussion
    will be easier over your code.

-   Do not add docstrings for the sake of adding docstrings, or only to address
    silenced Ruff issues. We expect docstrings to exist only when they add
    something significant to readers, such as explaining something that is not
    easier to understand from reading the corresponding code, summarizing a
    long, hard-to-read implementation, providing context about calling code, or
    indicating purposely uncaught exceptions from called code.

-   Do not add tests that use as much mocking as possible just to touch a given
    line of code and hence improve line coverage. While we do aim to maximize
    test coverage, tests should be written for real scenarios, with minimum
    mocking. We usually prefer end-to-end tests.

.. _writing-patches:

Writing patches
===============

The better a patch is written, the higher the chances that it'll get accepted and the sooner it will be merged.

Well-written patches should:

* contain the minimum amount of code required for the specific change. Small
  patches are easier to review and merge. So, if you're doing more than one
  change (or bug fix), please consider submitting one patch per change. Do not
  collapse multiple changes into a single patch. For big changes consider using
  a patch queue.

* pass all unit-tests. See `Running tests`_ below.

* include one (or more) test cases that check the bug fixed or the new
  functionality added. See `Writing tests`_ below.

* if you're adding or changing a public (documented) API, please include
  the documentation changes in the same patch.  See `Documentation policies`_
  below.

* if you're adding a private API, please add a regular expression to the
  ``coverage_ignore_pyobjects`` variable of ``docs/conf.py`` to exclude the new
  private API from documentation coverage checks.

  To see if your private API is skipped properly, generate a documentation
  coverage report as follows::

      tox -e docs-coverage

* if you are removing deprecated code, first make sure that at least 1 year
  (12 months) has passed since the release that introduced the deprecation.
  See :ref:`deprecation-policy`.

.. _submitting-patches:

Submitting patches
==================

The best way to submit a patch is to issue a `pull request`_ on GitHub,
optionally creating a new issue first.

Remember to explain what was fixed or the new functionality (what it is, why
it's needed, etc). The more info you include, the easier will be for core
developers to understand and accept your patch.

If your pull request aims to resolve an open issue, `link it accordingly
<https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword>`__,
e.g.:

.. code-block:: none

    Resolves #123

You can also discuss the new functionality (or bug fix) before creating the
patch, but it's always good to have a patch ready to illustrate your arguments
and show that you have put some additional thought into the subject. A good
starting point is to send a pull request on GitHub. It can be simple enough to
illustrate your idea, and leave documentation/tests for later, after the idea
has been validated and proven useful. Alternatively, you can start a
conversation in the `Scrapy subreddit`_ to discuss your idea first.

Sometimes there is an existing pull request for the problem you'd like to
solve, which is stalled for some reason. Often the pull request is in a
right direction, but changes are requested by Scrapy maintainers, and the
original pull request author hasn't had time to address them.
In this case consider picking up this pull request: open
a new pull request with all commits from the original pull request, as well as
additional changes to address the raised issues. Doing so helps a lot; it is
not considered rude as long as the original author is acknowledged by keeping
his/her commits.

You can pull an existing pull request to a local branch
by running ``git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE``
(replace 'upstream' with a remote name for scrapy repository,
``$PR_NUMBER`` with an ID of the pull request, and ``$BRANCH_NAME_TO_CREATE``
with a name of the branch you want to create locally).
See also: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally.

When writing GitHub pull requests, try to keep titles short but descriptive.
E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests"
prefer "Fix hanging when exception occurs in start_requests (#411)"
instead of "Fix for #411". Complete titles make it easy to skim through
the issue tracker.

Finally, try to keep aesthetic changes (:pep:`8` compliance, unused imports
removal, etc) in separate commits from functional changes. This will make pull
requests easier to review and more likely to get merged.

.. _coding-style:

Coding style
============

Please follow these coding conventions when writing code for inclusion in
Scrapy:

* We use `Ruff <https://docs.astral.sh/ruff/>`_ for code formatting.
  There is a hook in the pre-commit config
  that will automatically format your code before every commit. You can also
  run Ruff manually with ``tox -e pre-commit``.

* Don't put your name in the code you contribute; git provides enough
  metadata to identify author of the code.
  See https://docs.github.com/en/get-started/git-basics/setting-your-username-in-git
  for setup instructions.

.. _scrapy-pre-commit:

Pre-commit
==========

We use `pre-commit`_ to automatically address simple code issues before every
commit.

.. _pre-commit: https://pre-commit.com/

After your create a local clone of your fork of the Scrapy repository:

#.  `Install pre-commit <https://pre-commit.com/#installation>`_.

#.  On the root of your local clone of the Scrapy repository, run the following
    command:

    .. code-block:: bash

       pre-commit install

Now pre-commit will check your changes every time you create a Git commit. Upon
finding issues, pre-commit aborts your commit, and either fixes those issues
automatically, or only reports them to you. If it fixes those issues
automatically, creating your commit again should succeed. Otherwise, you may
need to address the corresponding issues manually first.

.. _documentation-policies:

Documentation policies
======================

For reference documentation of API members (classes, methods, etc.) use
docstrings and make sure that the Sphinx documentation uses the
:mod:`~sphinx.ext.autodoc` extension to pull the docstrings. API reference
documentation should follow docstring conventions (`PEP 257`_) and be
IDE-friendly: short, to the point, and it may provide short examples.

Other types of documentation, such as tutorials or topics, should be covered in
files within the ``docs/`` directory. This includes documentation that is
specific to an API member, but goes beyond API reference documentation.

In any case, if something is covered in a docstring, use the
:mod:`~sphinx.ext.autodoc` extension to pull the docstring into the
documentation instead of duplicating the docstring in files within the
``docs/`` directory.

Documentation updates that cover new or modified features must use Sphinx’s
:rst:dir:`versionadded` and :rst:dir:`versionchanged` directives. Use
``VERSION`` as version, we will replace it with the actual version right before
the corresponding release. When we release a new major or minor version of
Scrapy, we remove these directives if they are older than 3 years.

Documentation about deprecated features must be removed as those features are
deprecated, so that new readers do not run into it. New deprecations and
deprecation removals are documented in the :ref:`release notes <news>`.

.. _write-tests:

Tests
=====

Tests are implemented using the :doc:`Twisted unit-testing framework
<twisted:development/test-standard>`. Running tests requires
:doc:`tox <tox:index>`.

.. _running-tests:

Running tests
-------------

To run all tests::

    tox

To run a specific test (say ``tests/test_loader.py``) use:

    ``tox -- tests/test_loader.py``

To run the tests on a specific :doc:`tox <tox:index>` environment, use
``-e <name>`` with an environment name from ``tox.ini``. For example, to run
the tests with Python 3.10 use::

    tox -e py310

You can also specify a comma-separated list of environments, and use :ref:`tox’s
parallel mode <tox:parallel_mode>` to run the tests on multiple environments in
parallel::

    tox -e py39,py310 -p auto

To pass command-line options to :doc:`pytest <pytest:index>`, add them after
``--`` in your call to :doc:`tox <tox:index>`. Using ``--`` overrides the
default positional arguments defined in ``tox.ini``, so you must include those
default positional arguments (``scrapy tests``) after ``--`` as well::

    tox -- scrapy tests -x  # stop after first failure

You can also use the `pytest-xdist`_ plugin. For example, to run all tests on
the Python 3.10 :doc:`tox <tox:index>` environment using all your CPU cores::

    tox -e py310 -- scrapy tests -n auto

To see coverage report install :doc:`coverage <coverage:index>`
(``pip install coverage``) and run:

    ``coverage report``

see output of ``coverage --help`` for more options like html or xml report.

Writing tests
-------------

All functionality (including new features and bug fixes) must include a test
case to check that it works as expected, so please include tests for your
patches if you want them to get accepted sooner.

Scrapy uses unit-tests, which are located in the `tests/`_ directory.
Their module name typically resembles the full path of the module they're
testing. For example, the item loaders code is in::

    scrapy.loader

And their unit-tests are in::

    tests/test_loader.py

.. _issue tracker: https://github.com/scrapy/scrapy/issues
.. _scrapy-users: https://groups.google.com/forum/#!forum/scrapy-users
.. _Scrapy subreddit: https://www.reddit.com/r/scrapy/
.. _tests/: https://github.com/scrapy/scrapy/tree/master/tests
.. _open issues: https://github.com/scrapy/scrapy/issues
.. _PEP 257: https://peps.python.org/pep-0257/
.. _pull request: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
.. _pytest-xdist: https://github.com/pytest-dev/pytest-xdist
.. _help wanted issues: https://github.com/scrapy/scrapy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22
.. _test coverage: https://app.codecov.io/gh/scrapy/scrapy


.. _versioning:

============================
Versioning and API stability
============================

Versioning
==========

There are 3 numbers in a Scrapy version: *A.B.C*

* *A* is the major version. This will rarely change and will signify very
  large changes.
* *B* is the release number. This will include many changes including features
  and things that possibly break backward compatibility, although we strive to
  keep these cases at a minimum.
* *C* is the bugfix release number.

Backward-incompatibilities are explicitly mentioned in the :ref:`release notes <news>`,
and may require special attention before upgrading.

Development releases do not follow 3-numbers version and are generally
released as ``dev`` suffixed versions, e.g. ``1.3dev``.

.. note::
    With Scrapy 0.* series, Scrapy used odd-numbered versions for development releases.
    This is not the case anymore from Scrapy 1.0 onwards.

    Starting with Scrapy 1.0, all releases should be considered production-ready.

For example:

* *1.1.1* is the first bugfix release of the *1.1* series (safe to use in
  production)

API stability
=============

API stability was one of the major goals for the *1.0* release.

Methods or functions that start with a single dash (``_``) are private and
should never be relied as stable.

Also, keep in mind that stable doesn't mean complete: stable APIs could grow
new methods or functionality but the existing methods should keep working the
same way.

.. _deprecation-policy:

Deprecation policy
==================

We aim to maintain support for deprecated Scrapy features for at least 1 year.

For example, if a feature is deprecated in a Scrapy version released on
June 15th 2020, that feature should continue to work in versions released on
June 14th 2021 or before that.

Any new Scrapy release after a year *may* remove support for that deprecated
feature.

All deprecated features removed in a Scrapy release are explicitly mentioned in
the :ref:`release notes <news>`.


:orphan:

======================================
Scrapy documentation quick start guide
======================================

This file provides a quick guide on how to compile the Scrapy documentation.

Setup the environment
---------------------

To compile the documentation you need Sphinx Python library. To install it
and all its dependencies run the following command from this dir

::

    pip install -r requirements.txt

Compile the documentation
-------------------------

To compile the documentation (to classic HTML output) run the following command
from this dir::

    make html

Documentation will be generated (in HTML format) inside the ``build/html`` dir.

View the documentation
----------------------

To view the documentation run the following command::

    make htmlview

This command will fire up your default browser and open the main page of your
(previously generated) HTML documentation.

Start over
----------

To clean up all generated documentation files and start from scratch run::

    make clean

Keep in mind that this command won't touch any documentation source files.

Recreating documentation on the fly
-----------------------------------

There is a way to recreate the doc automatically when you make changes, you
need to install watchdog (``pip install watchdog``) and then use::

    make watch

Alternative method using tox
----------------------------

To compile the documentation to HTML run the following command::

    tox -e docs

Documentation will be generated inside the ``docs/_build/all`` dir.

