woob.browser.browsers

class Browser(*args, **kwargs)[source]

Bases: object

Simple browser class. Act like a browser, and don’t try to do too much.

PROFILE = <woob.browser.profiles.Firefox object>

Default profile used by browser to navigate on websites.

TIMEOUT = 10.0

Default timeout during requests.

REFRESH_MAX = 0.0

When handling a Refresh header, the browsers considers it only if the sleep time in lesser than this value.

MAX_RETRIES = 2

Maximum retries on failed requests.

MAX_WORKERS = 10

Maximum of threads for asynchronous requests.

ALLOW_REFERRER = True

Controls the behavior of get_referrer.

HTTP_ADAPTER_CLASS

Adapter class to use.

alias of HTTPAdapter

COOKIE_POLICY = None

Default CookieJar policy. Example: woob.browser.cookies.BlockAllCookies()

classmethod asset(localfile)[source]

Absolute file path for a module local file.

VERIFY = True

Check SSL certificates.

deinit()[source]
set_normalized_url(response, **kwargs)[source]
save_response(response, warning=False, **kwargs)[source]
set_profile(profile)[source]
location(url, **kwargs)[source]

Like open() but also changes the current URL and response. This is the most common method to request web pages.

Other than that, has the exact same behavior of open().

open(url, *, referrer=None, allow_redirects=True, stream=None, timeout=None, verify=None, cert=None, proxies=None, data_encoding=None, is_async=False, callback=<function Browser.<lambda>>, **kwargs)[source]
Make an HTTP request like a browser does:
  • follow redirects (unless disabled)

  • provide referrers (unless disabled)

Unless a method is explicitly provided, it makes a GET request, or a POST if data is not None, An empty data (not None, like ‘’ or {}) will make a POST.

It is a wrapper around session.request(). All session.request() options are available. You should use location() or open() and not session.request(), since it has some interesting additions, which are easily individually disabled through the arguments.

Call this instead of location() if you do not want to “visit” the URL (for instance, you are downloading a file).

When is_async is True, open() returns a Future object (see concurrent.futures for more details), which can be evaluated with its result() method. If any exception is raised while processing request, it is caught and re-raised when calling result().

For example:

>>> Browser().open('http://google.com', is_async=True).result().text 
Parameters
  • url (str or dict or None) – URL

  • data – POST data

  • referrer (str or False or None) – Force referrer. False to disable sending it, None for guessing

  • is_async (bool) – Process request in a non-blocking way

  • callback (function) – Callback to be called when request has finished, with response as its first and only argument

Return type

requests.Response

async_open(url, **kwargs)[source]

Shortcut to open(url, is_async=True).

raise_for_status(response)[source]

Like Response.raise_for_status but will use other classes if needed.

build_request(url, *, referrer=None, data_encoding=None, **kwargs)[source]

Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.

prepare_request(req)[source]

Get a prepared request from a Request object.

This method aims to be overloaded by children classes.

REFRESH_RE = re.compile('^(?P<sleep>[\\d\\.]+)(;\\s*url=[\\"\']?(?P<url>.*?)[\\"\']?)?$', re.IGNORECASE)
handle_refresh(response)[source]

Called by open, to handle Refresh HTTP header.

It only redirect to the refresh URL if the sleep time is inferior to REFRESH_MAX.

get_referrer(oldurl, newurl)[source]

Get the referrer to send when doing a request. If we should not send a referrer, it will return None.

Reference: https://en.wikipedia.org/wiki/HTTP_referer

The behavior can be controlled through the ALLOW_REFERRER attribute. True always allows the referers to be sent, False never, and None only if it is within the same domain.

Parameters
  • oldurl (str or None) – Current absolute URL

  • newurl (str) – Target absolute URL

Return type

str or None

export_session()[source]
exception UrlNotAllowed[source]

Bases: Exception

Raises by DomainBrowser when RESTRICT_URL is set and trying to go on an url not matching BASEURL.

class DomainBrowser(*args, **kwargs)[source]

Bases: Browser

A browser that handles relative URLs and can have a base URL (usually a domain).

For instance self.location(‘/hello’) will get http://woob.tech/hello if BASEURL is ‘http://woob.tech/’.

RESTRICT_URL = False

URLs allowed to load. This can be used to force SSL (if the BASEURL is SSL) or any other leakage. Set to True to allow only URLs starting by the BASEURL. Set it to a list of allowed URLs if you have multiple allowed URLs. More complex behavior is possible by overloading url_allowed()

BASEURL = None

Base URL, e.g. ‘http://woob.tech/’ or ‘https://woob.tech/’ See absurl().

url_allowed(url)[source]

Checks if we are allowed to visit an URL. See RESTRICT_URL.

Parameters

url (str) – Absolute URL

Return type

bool

absurl(uri, base=None)[source]

Get the absolute URL, relative to a base URL. If base is None, it will try to use the current URL. If there is no current URL, it will try to use BASEURL.

If base is False, it will always try to use the current URL. If base is True, it will always try to use BASEURL.

Parameters
  • uri (str) – URI to make absolute. It can be already absolute.

  • base (str or None or False or True) – Base absolute URL.

Return type

str

open(req, *args, **kwargs)[source]

Like Browser.open() but handles urls without domains, using the BASEURL attribute.

go_home()[source]

Go to the “home” page, usually the BASEURL.

class PagesBrowser(*args, **kwargs)[source]

Bases: DomainBrowser

A browser which works pages and keep state of navigation.

To use it, you have to derive it and to create URL objects as class attributes. When open() or location() are called, if the url matches one of URL objects, it returns a Page object. In case of location(), it stores it in self.page.

Example:

>>> from .pages import HTMLPage
>>> class ListPage(HTMLPage):
...     def get_items():
...         return [el.attrib['id'] for el in self.doc.xpath('//div[@id="items"]/div')]
...
>>> class ItemPage(HTMLPage):
...     pass
...
>>> class MyBrowser(PagesBrowser):
...     BASEURL = 'http://example.org/'
...     list = URL('list-items', ListPage)
...     item = URL('item/view/(?P<id>\d+)', ItemPage)
...
>>> MyBrowser().list.stay_or_go().get_items() 
>>> bool(MyBrowser().list.match('http://example.org/list-items'))
True
>>> bool(MyBrowser().list.match('http://example.org/'))
False
>>> str(MyBrowser().item.build(id=42))
'http://example.org/item/view/42'

You can then use URL instances to go on pages.

open(*args, **kwargs)[source]

Same method than woob.browser.browsers.DomainBrowser.open(), but the response contains an attribute page if the url matches any URL object.

location(*args, **kwargs)[source]

Same method than woob.browser.browsers.Browser.location(), but if the url matches any URL object, an attribute page is added to response, and the attribute PagesBrowser.page is set.

pagination(func, *args, **kwargs)[source]

This helper function can be used to handle pagination pages easily.

When the called function raises an exception NextPage, it goes on the wanted page and recall the function.

NextPage constructor can take an url or a Request object.

>>> from .pages import HTMLPage
>>> class Page(HTMLPage):
...     def iter_values(self):
...         for el in self.doc.xpath('//li'):
...             yield el.text
...         for next in self.doc.xpath('//a'):
...             raise NextPage(next.attrib['href'])
...
>>> class Browser(PagesBrowser):
...     BASEURL = 'https://woob.tech'
...     list = URL('/tests/list-(?P<pagenum>\d+).html', Page)
...
>>> b = Browser()
>>> b.list.go(pagenum=1) 
<woob.browser.browsers.Page object at 0x...>
>>> list(b.pagination(lambda: b.page.iter_values()))
['One', 'Two', 'Three', 'Four']
need_login(func)[source]

Decorator used to require to be logged to access to this function.

This decorator can be used on any method whose first argument is a browser (typically a LoginBrowser). It checks for the logged attribute in the current browser’s page: when this attribute is set to True (e.g., when the page inherits LoggedPage), then nothing special happens.

In all other cases (when the browser isn’t on any defined page or when the page’s logged attribute is False), the LoginBrowser.do_login() method of the browser is called before calling :func.

class LoginBrowser(*args, **kwargs)[source]

Bases: PagesBrowser

A browser which supports login.

do_login()[source]

Abstract method to implement to login on website.

It is called when a login is needed.

do_logout()[source]

Logout from website.

By default, simply clears the cookies.

class StatesMixin[source]

Bases: object

Mixin to store states of browser.

STATE_DURATION = None

In minutes, used to set an expiration datetime object of the state.

locate_browser(state)[source]
load_state(state)[source]
get_expire()[source]
dump_state()[source]
class APIBrowser(*args, **kwargs)[source]

Bases: DomainBrowser

A browser for API websites.

build_request(*args, **kwargs)[source]

Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.

open(*args, **kwargs)[source]

Do a JSON request.

The “Content-Type” header is always set to “application/json”.

Parameters
  • data (dict) – if specified, format as JSON and send as request body

  • headers (dict) – if specified, add these headers to the request

request(*args, **kwargs)[source]

Do a JSON request and parse the response.

Returns

a dict containing the parsed JSON server response

Return type

dict

exception AbstractBrowserMissingParentError[source]

Bases: Exception

class MetaBrowser(name, bases, dct)[source]

Bases: type

class AbstractBrowser[source]

Bases: object

Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead

class OAuth2Mixin(*args, **kwargs)[source]

Bases: StatesMixin

AUTHORIZATION_URI = None
ACCESS_TOKEN_URI = None
SCOPE = ''
client_id = None
client_secret = None
redirect_uri = None
access_token = None
access_token_expire = None
auth_uri = None
token_type = None
refresh_token = None
oauth_state = None
authorized_date = None
build_request(*args, **kwargs)[source]
dump_state()[source]
load_state(state)[source]
raise_for_status(response)[source]
property logged
do_login()[source]
build_authorization_parameters()[source]
build_authorization_uri()[source]
request_authorization()[source]
handle_callback_error(values)[source]
build_access_token_parameters(values)[source]
do_token_request(data)[source]
request_access_token(auth_uri)[source]
build_refresh_token_parameters()[source]
use_refresh_token()[source]
update_token(auth_response)[source]
class OAuth2PKCEMixin(*args, **kwargs)[source]

Bases: OAuth2Mixin

code_verifier(bytes_number=64)[source]
code_challenge(verifier)[source]
build_authorization_parameters()[source]
build_access_token_parameters(values)[source]