woob.browser.browsers¶
- class Browser(logger=None, proxy=None, responses_dirname=None, proxy_headers=None, woob=None, weboob=None, *, verify=None)[source]¶
Bases:
objectSimple browser class. Acts like a browser, and doesn’t try to do too much.
>>> with Browser() as browser: ... browser.open('https://example.org') ... <Response [200]>
- Parameters:
logger (
logging.Logger) – parent logger (optional) (default:None)proxy (dict) – use a proxy (dictionary with http/https as key and URI as value) (optional) (default:
None)responses_dirname (str) – save responses to this directory (optional) (default:
None)proxy_headers (dict) – headers to supply to proxy (optional) (default:
None)verify (None, bool or str) – either a boolean, in which case it controls whether we verify the server’s (default:
None) TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults will use theBrowser.VERIFYattribute.
-
PROFILE:
ClassVar[Profile] = <woob.browser.profiles.Firefox object>¶ Default profile used by browser to navigate on websites.
-
REFRESH_MAX:
ClassVar[float] = 0.0¶ When handling a Refresh header, the browsers considers it only if the sleep time in lesser than this value.
-
VERIFY:
ClassVar[bool|str] = True¶ Check SSL certificates.
If this is a string, path to the certificate or the CA bundle.
Note that this value may be overriden by the
verifyargument on the constructor.
-
ALLOW_REFERRER:
ClassVar[bool] = True¶ Controls how we send the
Refereror not.If True, always allows the referers to be sent, False never, and None only if it is within the same domain.
- HTTP_ADAPTER_CLASS¶
Adapter class to use.
alias of
HTTPAdapter
-
COOKIE_POLICY:
ClassVar[CookiePolicy|None] = None¶ Default CookieJar policy.
Example:
BlockAllCookies()
- deinit()[source]¶
Deinitialisation of the browser.
Call it when you stop to use the browser and you don’t use it in a context manager.
Can be overrided by any subclass which wants to cleanup after browser usage.
- set_normalized_url(response, **kwargs)[source]¶
Set the normalized URL on the response.
- Parameters:
response (
requests.Response) – the response to change
- save_response(response, warning=False, **kwargs)[source]¶
Save responses.
By default it creates an HAR file and append request and response in.
If
WOOB_USE_OBSOLETE_RESPONSES_DIRis set to 1, it’ll create a directory and all requests will be saved in three files:0X-url-request.txt0X-url-response.txt0X-url.EXT
Information about which files are created is display in logs.
Also if
WOOB_CURLIFY_REQUESTis set to 1,0X-url-request.txtwill be filled with a ready to use curl command based on the request.- Parameters:
response (
requests.Response) – the response to savewarning (bool) – if True, display the saving logs as warnings (default to False) (default:
False)
- location(url, **kwargs)[source]¶
Like
open()but also changes the current URL and response. This is the most common method to request web pages.Other than that, has the exact same behavior of
open().- Return type:
- open(url, *, referrer=None, allow_redirects=True, stream=None, timeout=None, verify=None, cert=None, proxies=None, data_encoding=None, is_async=False, callback=None, **kwargs)[source]¶
- Make an HTTP request like a browser does:
follow redirects (unless disabled)
provide referrers (unless disabled)
Unless a
methodis explicitly provided, it makes a GET request, or a POST if data is not None, An emptydata(like''or{}, notNone) will make a POST.It is a wrapper around session.request(). All
session.request()options are available. You should uselocation()oropen()and notsession.request(), since it has some interesting additions, which are easily individually disabled through the arguments.Call this instead of
location()if you do not want to “visit” the URL (for instance, you are downloading a file).When
is_asyncisTrue,open()returns aFutureobject (seeconcurrent.futuresfor more details), which can be evaluated with itsresult()method. If any exception is raised while processing request, it is caught and re-raised when callingresult().For example:
>>> Browser().open('https://google.com', is_async=True).result().text
- Parameters:
params – (optional) Dictionary, list of tuples or bytes to send in the query string
data – (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body
json – (optional) A JSON serializable Python object to send in the body
headers – (optional) Dictionary of HTTP Headers to send
cookies – (optional) Dict or CookieJar object to send
files – (optional) Dictionary of
'name': file-like-objects(or{'name': file-tuple}) for multipart encoding upload.file-tuplecan be a 2-tuple('filename', fileobj), 3-tuple('filename', fileobj, 'content_type')or a 4-tuple('filename', fileobj, 'content_type', custom_headers), where'content-type'is a string defining the content type of the given file andcustom_headersa dict-like object containing additional headers to add for the file.auth – (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
referrer (str or False or None) – (optional) Force referrer. False to disable sending it, None for guessing (default:
None)allow_redirects (bool) – (optional) if
True, follow HTTP redirects (default:True) (default:True)stream (
bool|None) – (optional) ifFalse, the response content will be immediately downloaded. (default:None)timeout (float or tuple) – (optional) How many seconds to wait for the server to send data (default:
None) before giving up, as a float, or a tuple.verify (
str|bool|None) – (optional) Either a boolean, in which case it controls whether we verify (default:None) the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. If not provided, uses theBrowser.VERIFYclass attribute value, theBrowser.verifyattribute one, orTrue.cert (
str|tuple[str,str] |None) – (optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert’, ‘key’) pair. (default:None)proxies (
dict|None) – (optional) Dictionary mapping protocol to the URL of the proxy. (default:None)is_async (bool) – (optional) Process request in a non-blocking way (default:
False) (default:False)callback (callable) – (optional) Callback to be called when request has finished, (default:
None) with response as its first and only argument
- Returns:
requests.Responseobject- Return type:
- raise_for_status(response)[source]¶
Like
requests.Response.raise_for_status()but will use other exception specific classes:HTTPNotFoundfor 404ClientErrorfor 4xx errorsServerErrorfor 5xx errors
- build_request(url, *, referrer=None, data_encoding=None, **kwargs)[source]¶
Does the same job as
open(), but returns aRequestwithout submitting it. This allows further customization to theRequest.- Return type:
- prepare_request(req)[source]¶
Get a prepared request from a
Requestobject.This method aims to be overloaded by children classes.
- Return type:
- REFRESH_RE = re.compile('^(?P<sleep>[\\d\\.]+)(;\\s*url=[\\"\']?(?P<url>.*?)[\\"\']?)?$', re.IGNORECASE)¶
- handle_refresh(response)[source]¶
Called by open, to handle Refresh HTTP header.
It only redirect to the refresh URL if the sleep time is inferior to
REFRESH_MAX.- Return type:
- get_referrer(oldurl, newurl)[source]¶
Get the referrer to send when doing a request. If we should not send a referrer, it will return None.
Reference: https://en.wikipedia.org/wiki/HTTP_referer
The behavior can be controlled through the ALLOW_REFERRER attribute. True always allows the referers to be sent, False never, and None only if it is within the same domain.
- exception UrlNotAllowed[source]¶
Bases:
ExceptionRaises by
DomainBrowserwhenRESTRICT_URLis set and trying to go on an url not matchingBASEURL.
- class DomainBrowser(baseurl=None, *args, **kwargs)[source]¶
Bases:
BrowserA browser that handles relative URLs and can have a base URL (usually a domain).
For instance
self.location('/hello')will get https://woob.tech/hello ifBASEURLis'https://woob.tech/'.>>> class ExampleBrowser(DomainBrowser): ... BASEURL = 'https://example.org' ... >>> with ExampleBrowser() as browser: ... browser.open('/') ... <Response [200]>
-
RESTRICT_URL:
ClassVar[bool|list[str]] = False¶ URLs allowed to load. This can be used to force SSL (if the
BASEURLis SSL) or any other leakage. Set toTrueto allow only URLs starting by theBASEURL. Set it to a list of allowed URLs if you have multiple allowed URLs. More complex behavior is possible by overloadingurl_allowed().
- absurl(uri, base=None)[source]¶
Get the absolute URL, relative to a base URL. If base is
None, it will try to use the current URL. If there is no current URL, it will try to useBASEURL.If base is
False, it will always try to use the current URL. If base isTrue, it will always try to use BASEURL.
- open(url, *args, **kwargs)[source]¶
Like
Browser.open()but handles urls without domains, using theBASEURLattribute.- Return type:
-
RESTRICT_URL:
- class PagesBrowser(*args, **kwargs)[source]¶
Bases:
DomainBrowserA browser which works pages and keep state of navigation.
To use it, you have to derive it and to create
URLobjects as class attributes. Whenopen()orlocation()are called, if the url matches one ofURLobjects, it returns aPageobject. In case oflocation(), it stores it inself.page.Example:
>>> import re >>> from .pages import HTMLPage >>> class ListPage(HTMLPage): ... def get_items(self): ... for link in self.doc.xpath('//a[matches(@href, "list-\d+.html")]/@href'): ... yield re.match('list-(\d+).html', link).group(1) ... >>> class ItemPage(HTMLPage): ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... >>> class MyBrowser(PagesBrowser): ... BASEURL = 'https://woob.tech/tests/' ... list = URL(r'$', ListPage) ... item = URL(r'list-(?P<id>\d+)\.html', ItemPage) ... >>> b = MyBrowser() >>> b.list.go() <woob.browser.browsers.ListPage object at 0x...> >>> b.page.url 'https://woob.tech/tests/' >>> list(b.page.get_items()) ['1', '2'] >>> b.item.build(id=42) 'https://woob.tech/tests/list-42.html' >>> b.item.go(id=1) <woob.browser.browsers.ItemPage object at 0x...> >>> list(b.page.iter_values()) ['One', 'Two']
- open(*args, **kwargs)[source]¶
Same method than
open(), but the response contains an attributepageif the url matches anyURLobject.- Return type:
- location(*args, **kwargs)[source]¶
Same method than
location(), but if the url matches anyURLobject, an attributepageis added to response, and the attributepageis set on the browser.- Return type:
- pagination(func, *args, **kwargs)[source]¶
This helper function can be used to handle pagination pages easily.
When the called function raises an exception
NextPage, it goes on the wanted page and recall the function.NextPageconstructor can take an url or a Request object.>>> from .pages import HTMLPage >>> class Page(HTMLPage): ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... for next in self.doc.xpath('//a'): ... raise NextPage(next.attrib['href']) ... >>> class Browser(PagesBrowser): ... BASEURL = 'https://woob.tech' ... list = URL('/tests/list-(?P<pagenum>\d+).html', Page) ... >>> b = Browser() >>> b.list.go(pagenum=1) <woob.browser.browsers.Page object at 0x...> >>> list(b.pagination(lambda: b.page.iter_values())) ['One', 'Two', 'Three', 'Four']
- need_login(func)[source]¶
Decorator used to require to be logged to access to this function.
This decorator can be used on any method whose first argument is a browser (typically a
LoginBrowser). It checks for theloggedattribute in the current browser’s page: when this attribute is set toTrue(e.g., when the page inheritsLoggedPage), then nothing special happens.In all other cases (when the browser isn’t on any defined page or when the page’s
loggedattribute isFalse), theLoginBrowser.do_login()method of the browser is called before calling :func.
- class LoginBrowser(username, password, *args, **kwargs)[source]¶
Bases:
PagesBrowserA browser which supports login.
- class StatesMixin[source]¶
Bases:
objectMixin to store states of browser.
It saves and loads a
statedict object. By default it contains the current url and cookies, but may be overriden by the subclass to store its specific stuff.-
STATE_DURATION:
ClassVar[int|float|None] = None¶ In minutes, used to set an expiration datetime object of the state.
- get_expire()[source]¶
Get expiration of the
stateobject, using theSTATE_DURATIONclass attribute.
-
STATE_DURATION:
- class APIBrowser(baseurl=None, *args, **kwargs)[source]¶
Bases:
DomainBrowserA browser for API websites.
- build_request(*args, **kwargs)[source]¶
Does the same job as
open(), but returns aRequestwithout submitting it. This allows further customization to theRequest.- Return type:
- exception AbstractBrowserMissingParentError[source]¶
Bases:
ExceptionDeprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class MetaBrowser(name, bases, dct)[source]¶
Bases:
typeDeprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class AbstractBrowser[source]¶
Bases:
objectDeprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class OAuth2Mixin(*args, **kwargs)[source]¶
Bases:
StatesMixin- AUTHORIZATION_URI: ClassVar[str, None] = None¶
OAuth2 Authorization URI.
- ACCESS_TOKEN_URI: ClassVar[str, None] = None¶
OAuth2 route to exchange a code with an access_token.
- SCOPE: ClassVar[str] = ''¶
OAuth2 scope.
- client_id: str | None = None¶
- client_secret: str | None = None¶
- redirect_uri: str | None = None¶
- access_token: str | None = None¶
- access_token_expire: datetime | None = None¶
- auth_uri: str | None = None¶
- token_type: str | None = None¶
- refresh_token: str | None = None¶
- oauth_state: str | None = None¶
- authorized_date: str | None = None¶
- callback_error_description = ('operation canceled by the client', 'login cancelled', 'consent denied', 'psu cancelled the transaction')¶
- class OAuth2PKCEMixin(*args, **kwargs)[source]¶
Bases:
OAuth2Mixin
- class DigestMixin[source]¶
Bases:
objectBrowser mixin to add a
Digestheader compliant with RFC 3230 section 4.3.2.-
HTTP_DIGEST_ALGORITHM:
str= 'SHA-256'¶ Digest algorithm used to obtain a hash of the request content.
The only supported digest algorithm for now is ‘SHA-256’.
-
HTTP_DIGEST_METHODS:
tuple[str, Ellipsis] |None= ('GET', 'POST', 'PUT', 'DELETE')¶ The list of HTTP methods on which to add a
Digestheader.To add the
Digestheader to all methods, set this constant to None.
-
HTTP_DIGEST_COMPACT_JSON:
bool= False¶ If the content type of the request payload is JSON, compact it first.
- add_digest_header(preq)[source]¶
Add the
Digestheader to the prepared request.The
Digestheader presence depends on the request:If the request has a
HTTP_DIGEST_INCLUDEheader, theDigestheader is added.Otherwise, if the request has a
HTTP_DIGEST_EXCLUDEheader, theDigestheader is not added.Otherwise, if
HTTP_DIGEST_METHODis an HTTP method list and the request method is not in said list, theDigestheader is not added.Otherwise, the
Digestheader is added.
Note that the
HTTP_DIGEST_INCLUDEandHTTP_DIGEST_EXCLUDEheaders are removed from the request before sending it.- Parameters:
preq (
PreparedRequest) – The prepared request on which theDigestheader is added.- Return type:
class MyBrowser(DigestMixin, Browser): HTTP_DIGEST_METHODS = ('POST', 'PUT', 'DELETE') my_browser = MyBrowser() my_browser.open('https://example.org/')
-
HTTP_DIGEST_ALGORITHM: