woob.browser.browsers

class Browser(logger=None, proxy=None, responses_dirname=None, proxy_headers=None, woob=None, weboob=None, *, verify=None)[source]

Bases: object

Simple browser class. Acts like a browser, and doesn’t try to do too much.

>>> with Browser() as browser:
...     browser.open('https://example.org')
...
<Response [200]>
Parameters:
  • logger (logging.Logger) – parent logger (optional) (default: None)

  • proxy (dict) – use a proxy (dictionary with http/https as key and URI as value) (optional) (default: None)

  • responses_dirname (str) – save responses to this directory (optional) (default: None)

  • proxy_headers (dict) – headers to supply to proxy (optional) (default: None)

  • verify (None, bool or str) – either a boolean, in which case it controls whether we verify the server’s (default: None) TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults will use the Browser.VERIFY attribute.

PROFILE: ClassVar[Profile] = <woob.browser.profiles.Firefox object>

Default profile used by browser to navigate on websites.

TIMEOUT: ClassVar[float] = 10.0

Default timeout during requests.

REFRESH_MAX: ClassVar[float] = 0.0

When handling a Refresh header, the browsers considers it only if the sleep time in lesser than this value.

VERIFY: ClassVar[bool | str] = True

Check SSL certificates.

If this is a string, path to the certificate or the CA bundle.

Note that this value may be overriden by the verify argument on the constructor.

MAX_RETRIES: ClassVar[int] = 2

Maximum retries on failed requests.

MAX_WORKERS: ClassVar[int] = 10

Maximum of threads for asynchronous requests.

ALLOW_REFERRER: ClassVar[bool] = True

Controls how we send the Referer or not.

If True, always allows the referers to be sent, False never, and None only if it is within the same domain.

HTTP_ADAPTER_CLASS

Adapter class to use.

alias of HTTPAdapter

COOKIE_POLICY: ClassVar[CookiePolicy | None] = None

Default CookieJar policy.

Example: BlockAllCookies()

classmethod asset(localfile)[source]

Absolute file path for a module local file.

Return type:

str

deinit()[source]

Deinitialisation of the browser.

Call it when you stop to use the browser and you don’t use it in a context manager.

Can be overrided by any subclass which wants to cleanup after browser usage.

set_normalized_url(response, **kwargs)[source]

Set the normalized URL on the response.

Parameters:

response (requests.Response) – the response to change

save_response(response, warning=False, **kwargs)[source]

Save responses.

By default it creates an HAR file and append request and response in.

If WOOB_USE_OBSOLETE_RESPONSES_DIR is set to 1, it’ll create a directory and all requests will be saved in three files:

  • 0X-url-request.txt

  • 0X-url-response.txt

  • 0X-url.EXT

Information about which files are created is display in logs.

Also if WOOB_CURLIFY_REQUEST is set to 1, 0X-url-request.txt will be filled with a ready to use curl command based on the request.

Parameters:
  • response (requests.Response) – the response to save

  • warning (bool) – if True, display the saving logs as warnings (default to False) (default: False)

set_profile(profile)[source]

Update the profile of the session.

location(url, **kwargs)[source]

Like open() but also changes the current URL and response. This is the most common method to request web pages.

Other than that, has the exact same behavior of open().

Return type:

Response

open(url, *, referrer=None, allow_redirects=True, stream=None, timeout=None, verify=None, cert=None, proxies=None, data_encoding=None, is_async=False, callback=None, **kwargs)[source]
Make an HTTP request like a browser does:
  • follow redirects (unless disabled)

  • provide referrers (unless disabled)

Unless a method is explicitly provided, it makes a GET request, or a POST if data is not None, An empty data (like '' or {}, not None) will make a POST.

It is a wrapper around session.request(). All session.request() options are available. You should use location() or open() and not session.request(), since it has some interesting additions, which are easily individually disabled through the arguments.

Call this instead of location() if you do not want to “visit” the URL (for instance, you are downloading a file).

When is_async is True, open() returns a Future object (see concurrent.futures for more details), which can be evaluated with its result() method. If any exception is raised while processing request, it is caught and re-raised when calling result().

For example:

>>> Browser().open('https://google.com', is_async=True).result().text 
Parameters:
  • url (str | Request) – URL

  • params – (optional) Dictionary, list of tuples or bytes to send in the query string

  • data – (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body

  • json – (optional) A JSON serializable Python object to send in the body

  • headers – (optional) Dictionary of HTTP Headers to send

  • cookies – (optional) Dict or CookieJar object to send

  • files – (optional) Dictionary of 'name': file-like-objects (or {'name': file-tuple}) for multipart encoding upload. file-tuple can be a 2-tuple ('filename', fileobj), 3-tuple ('filename', fileobj, 'content_type') or a 4-tuple ('filename', fileobj, 'content_type', custom_headers), where 'content-type' is a string defining the content type of the given file and custom_headers a dict-like object containing additional headers to add for the file.

  • auth – (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

  • referrer (str or False or None) – (optional) Force referrer. False to disable sending it, None for guessing (default: None)

  • allow_redirects (bool) – (optional) if True, follow HTTP redirects (default: True) (default: True)

  • stream (bool | None) – (optional) if False, the response content will be immediately downloaded. (default: None)

  • timeout (float or tuple) – (optional) How many seconds to wait for the server to send data (default: None) before giving up, as a float, or a tuple.

  • verify (str | bool | None) – (optional) Either a boolean, in which case it controls whether we verify (default: None) the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. If not provided, uses the Browser.VERIFY class attribute value, the Browser.verify attribute one, or True.

  • cert (str | Tuple[str, str] | None) – (optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert’, ‘key’) pair. (default: None)

  • proxies (Dict | None) – (optional) Dictionary mapping protocol to the URL of the proxy. (default: None)

  • is_async (bool) – (optional) Process request in a non-blocking way (default: False) (default: False)

  • callback (callable) – (optional) Callback to be called when request has finished, (default: None) with response as its first and only argument

Returns:

requests.Response object

Return type:

requests.Response

async_open(url, **kwargs)[source]

Shortcut to open(url, is_async=True).

Return type:

Response

raise_for_status(response)[source]

Like requests.Response.raise_for_status() but will use other exception specific classes:

build_request(url, *, referrer=None, data_encoding=None, **kwargs)[source]

Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.

Return type:

Request

prepare_request(req)[source]

Get a prepared request from a Request object.

This method aims to be overloaded by children classes.

Return type:

PreparedRequest

REFRESH_RE = re.compile('^(?P<sleep>[\\d\\.]+)(;\\s*url=[\\"\']?(?P<url>.*?)[\\"\']?)?$', re.IGNORECASE)
handle_refresh(response)[source]

Called by open, to handle Refresh HTTP header.

It only redirect to the refresh URL if the sleep time is inferior to REFRESH_MAX.

Return type:

Response

get_referrer(oldurl, newurl)[source]

Get the referrer to send when doing a request. If we should not send a referrer, it will return None.

Reference: https://en.wikipedia.org/wiki/HTTP_referer

The behavior can be controlled through the ALLOW_REFERRER attribute. True always allows the referers to be sent, False never, and None only if it is within the same domain.

Parameters:
  • oldurl (str or None) – Current absolute URL

  • newurl (str) – Target absolute URL

Return type:

str or None

export_session()[source]

Export session into a dict.

Default format is:

{
    'url': last_url,
    'cookies': cookies_dict
}

You should store it as is.

Return type:

dict

exception UrlNotAllowed[source]

Bases: Exception

Raises by DomainBrowser when RESTRICT_URL is set and trying to go on an url not matching BASEURL.

class DomainBrowser(baseurl=None, *args, **kwargs)[source]

Bases: Browser

A browser that handles relative URLs and can have a base URL (usually a domain).

For instance self.location('/hello') will get https://woob.tech/hello if BASEURL is 'https://woob.tech/'.

>>> class ExampleBrowser(DomainBrowser):
...     BASEURL = 'https://example.org'
...
>>> with ExampleBrowser() as browser:
...     browser.open('/')
...
<Response [200]>
RESTRICT_URL: ClassVar[bool | List[str]] = False

URLs allowed to load. This can be used to force SSL (if the BASEURL is SSL) or any other leakage. Set to True to allow only URLs starting by the BASEURL. Set it to a list of allowed URLs if you have multiple allowed URLs. More complex behavior is possible by overloading url_allowed().

BASEURL: str | None = None

Base URL, e.g. 'https://woob.tech/'.

See absurl().

url_allowed(url)[source]

Checks if we are allowed to visit an URL. See RESTRICT_URL.

Parameters:

url (str) – Absolute URL

Return type:

bool

absurl(uri, base=None)[source]

Get the absolute URL, relative to a base URL. If base is None, it will try to use the current URL. If there is no current URL, it will try to use BASEURL.

If base is False, it will always try to use the current URL. If base is True, it will always try to use BASEURL.

Parameters:
  • uri (str) – URI to make absolute. It can be already absolute.

  • base (str or None or False or True) – Base absolute URL. (default: None)

Return type:

str

open(url, *args, **kwargs)[source]

Like Browser.open() but handles urls without domains, using the BASEURL attribute.

Return type:

Response

go_home()[source]

Go to the “home” page, usually the BASEURL.

Return type:

Response

class PagesBrowser(*args, **kwargs)[source]

Bases: DomainBrowser

A browser which works pages and keep state of navigation.

To use it, you have to derive it and to create URL objects as class attributes. When open() or location() are called, if the url matches one of URL objects, it returns a Page object. In case of location(), it stores it in self.page.

Example:

>>> import re
>>> from .pages import HTMLPage
>>> class ListPage(HTMLPage):
...     def get_items(self):
...         for link in self.doc.xpath('//a[matches(@href, "list-\d+.html")]/@href'):
...             yield re.match('list-(\d+).html', link).group(1)
...
>>> class ItemPage(HTMLPage):
...     def iter_values(self):
...         for el in self.doc.xpath('//li'):
...             yield el.text
...
>>> class MyBrowser(PagesBrowser):
...     BASEURL = 'https://woob.tech/tests/'
...     list = URL(r'$', ListPage)
...     item = URL(r'list-(?P<id>\d+)\.html', ItemPage)
...
>>> b = MyBrowser()
>>> b.list.go()
<woob.browser.browsers.ListPage object at 0x...>
>>> b.page.url
'https://woob.tech/tests/'
>>> list(b.page.get_items())
['1', '2']
>>> b.item.build(id=42)
'https://woob.tech/tests/list-42.html'
>>> b.item.go(id=1)
<woob.browser.browsers.ItemPage object at 0x...>
>>> list(b.page.iter_values())
['One', 'Two']
open(*args, **kwargs)[source]

Same method than open(), but the response contains an attribute page if the url matches any URL object.

Return type:

Response

location(*args, **kwargs)[source]

Same method than location(), but if the url matches any URL object, an attribute page is added to response, and the attribute page is set on the browser.

Return type:

Response

pagination(func, *args, **kwargs)[source]

This helper function can be used to handle pagination pages easily.

When the called function raises an exception NextPage, it goes on the wanted page and recall the function.

NextPage constructor can take an url or a Request object.

>>> from .pages import HTMLPage
>>> class Page(HTMLPage):
...     def iter_values(self):
...         for el in self.doc.xpath('//li'):
...             yield el.text
...         for next in self.doc.xpath('//a'):
...             raise NextPage(next.attrib['href'])
...
>>> class Browser(PagesBrowser):
...     BASEURL = 'https://woob.tech'
...     list = URL('/tests/list-(?P<pagenum>\d+).html', Page)
...
>>> b = Browser()
>>> b.list.go(pagenum=1) 
<woob.browser.browsers.Page object at 0x...>
>>> list(b.pagination(lambda: b.page.iter_values()))
['One', 'Two', 'Three', 'Four']
need_login(func)[source]

Decorator used to require to be logged to access to this function.

This decorator can be used on any method whose first argument is a browser (typically a LoginBrowser). It checks for the logged attribute in the current browser’s page: when this attribute is set to True (e.g., when the page inherits LoggedPage), then nothing special happens.

In all other cases (when the browser isn’t on any defined page or when the page’s logged attribute is False), the LoginBrowser.do_login() method of the browser is called before calling :func.

class LoginBrowser(username, password, *args, **kwargs)[source]

Bases: PagesBrowser

A browser which supports login.

do_login()[source]

Abstract method to implement to login on website.

It is called when a login is needed.

do_logout()[source]

Logout from website.

By default, simply clears the cookies.

class StatesMixin[source]

Bases: object

Mixin to store states of browser.

It saves and loads a state dict object. By default it contains the current url and cookies, but may be overriden by the subclass to store its specific stuff.

STATE_DURATION: ClassVar[int | float | None] = None

In minutes, used to set an expiration datetime object of the state.

locate_browser(state)[source]

From the state object, go on the saved url.

load_state(state)[source]

Supply a state object and load it.

get_expire()[source]

Get expiration of the state object, using the STATE_DURATION class attribute.

Return type:

str | None

dump_state()[source]

Dump the current state in a state object.

Can be overloaded by the browser subclass.

Return type:

dict

class APIBrowser(baseurl=None, *args, **kwargs)[source]

Bases: DomainBrowser

A browser for API websites.

build_request(*args, **kwargs)[source]

Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.

Return type:

Request

open(*args, **kwargs)[source]

Do a JSON request.

The “Content-Type” header is always set to “application/json”.

Parameters:
  • data (dict) – if specified, format as JSON and send as request body

  • headers (dict) – if specified, add these headers to the request

Return type:

Response

request(*args, **kwargs)[source]

Do a JSON request and parse the response.

Returns:

a dict containing the parsed JSON server response

Return type:

dict

exception AbstractBrowserMissingParentError[source]

Bases: Exception

Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead

class MetaBrowser(name, bases, dct)[source]

Bases: type

Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead

class AbstractBrowser[source]

Bases: object

Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead

class OAuth2Mixin(*args, **kwargs)[source]

Bases: StatesMixin

AUTHORIZATION_URI: ClassVar[str, None] = None

OAuth2 Authorization URI.

ACCESS_TOKEN_URI: ClassVar[str, None] = None

OAuth2 route to exchange a code with an access_token.

SCOPE: ClassVar[str] = ''

OAuth2 scope.

client_id: str | None = None
client_secret: str | None = None
redirect_uri: str | None = None
access_token: str | None = None
access_token_expire: datetime | None = None
auth_uri: str | None = None
token_type: str | None = None
refresh_token: str | None = None
oauth_state: str | None = None
authorized_date: str | None = None
callback_error_description = ('operation canceled by the client', 'login cancelled', 'consent denied', 'psu cancelled the transaction')
build_request(*args, **kwargs)[source]
Return type:

Request

dump_state()[source]

Dump the current state in a state object.

Can be overloaded by the browser subclass.

Return type:

dict

load_state(state)[source]

Supply a state object and load it.

raise_for_status(response)[source]
property logged: bool
do_login()[source]
build_authorization_parameters()[source]
Return type:

dict

build_authorization_uri()[source]
Return type:

str

request_authorization()[source]
handle_callback_error(values)[source]
build_access_token_parameters(values)[source]
Return type:

dict

do_token_request(data)[source]
request_access_token(auth_uri)[source]
build_refresh_token_parameters()[source]
Return type:

dict

use_refresh_token()[source]
update_token(auth_response)[source]
class OAuth2PKCEMixin(*args, **kwargs)[source]

Bases: OAuth2Mixin

code_verifier(bytes_number=64)[source]
Return type:

str

code_challenge(verifier)[source]
Return type:

str

build_authorization_parameters()[source]
Return type:

dict

build_access_token_parameters(values)[source]
Return type:

dict

class DigestMixin[source]

Bases: object

Browser mixin to add a Digest header compliant with RFC 3230 section 4.3.2.

HTTP_DIGEST_ALGORITHM: str = 'SHA-256'

Digest algorithm used to obtain a hash of the request content.

The only supported digest algorithm for now is ‘SHA-256’.

HTTP_DIGEST_METHODS: tuple[str, Ellipsis] | None = ('GET', 'POST', 'PUT', 'DELETE')

The list of HTTP methods on which to add a Digest header.

To add the Digest header to all methods, set this constant to None.

HTTP_DIGEST_COMPACT_JSON: bool = False

If the content type of the request payload is JSON, compact it first.

compute_digest_header(body)[source]

Compute the value of the Digest header.

Parameters:

body (bytes) – The body to compute with.

Return type:

str

Returns:

The computed Digest header value.

add_digest_header(preq)[source]

Add the Digest header to the prepared request.

The Digest header presence depends on the request:

  • If the request has a HTTP_DIGEST_INCLUDE header, the Digest header is added.

  • Otherwise, if the request has a HTTP_DIGEST_EXCLUDE header, the Digest header is not added.

  • Otherwise, if HTTP_DIGEST_METHOD is an HTTP method list and the request method is not in said list, the Digest header is not added.

  • Otherwise, the Digest header is added.

Note that the HTTP_DIGEST_INCLUDE and HTTP_DIGEST_EXCLUDE headers are removed from the request before sending it.

Parameters:

preq (PreparedRequest) – The prepared request on which the Digest header is added.

Return type:

None

class MyBrowser(DigestMixin, Browser):
    HTTP_DIGEST_METHODS = ('POST', 'PUT', 'DELETE')

my_browser = MyBrowser()
my_browser.open('https://example.org/')
prepare_request(*args, **kwargs)[source]

Get the prepared request with a Digest header.

Return type:

PreparedRequest