woob.browser.browsers
¶
- class Browser(logger=None, proxy=None, responses_dirname=None, proxy_headers=None, woob=None, weboob=None, *, verify=None)[source]¶
Bases:
object
Simple browser class. Acts like a browser, and doesn’t try to do too much.
>>> with Browser() as browser: ... browser.open('https://example.org') ... <Response [200]>
- Parameters:
logger (
logging.Logger
) – parent logger (optional) (default:None
)proxy (dict) – use a proxy (dictionary with http/https as key and URI as value) (optional) (default:
None
)responses_dirname (str) – save responses to this directory (optional) (default:
None
)proxy_headers (dict) – headers to supply to proxy (optional) (default:
None
)verify (None, bool or str) – either a boolean, in which case it controls whether we verify the server’s (default:
None
) TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults will use theBrowser.VERIFY
attribute.
-
PROFILE:
ClassVar
[Profile
] = <woob.browser.profiles.Firefox object>¶ Default profile used by browser to navigate on websites.
-
REFRESH_MAX:
ClassVar
[float
] = 0.0¶ When handling a Refresh header, the browsers considers it only if the sleep time in lesser than this value.
-
VERIFY:
ClassVar
[bool
|str
] = True¶ Check SSL certificates.
If this is a string, path to the certificate or the CA bundle.
Note that this value may be overriden by the
verify
argument on the constructor.
-
ALLOW_REFERRER:
ClassVar
[bool
] = True¶ Controls how we send the
Referer
or not.If True, always allows the referers to be sent, False never, and None only if it is within the same domain.
- HTTP_ADAPTER_CLASS¶
Adapter class to use.
alias of
HTTPAdapter
-
COOKIE_POLICY:
ClassVar
[CookiePolicy
|None
] = None¶ Default CookieJar policy.
Example:
BlockAllCookies()
- deinit()[source]¶
Deinitialisation of the browser.
Call it when you stop to use the browser and you don’t use it in a context manager.
Can be overrided by any subclass which wants to cleanup after browser usage.
- set_normalized_url(response, **kwargs)[source]¶
Set the normalized URL on the response.
- Parameters:
response (
requests.Response
) – the response to change
- save_response(response, warning=False, **kwargs)[source]¶
Save responses.
By default it creates an HAR file and append request and response in.
If
WOOB_USE_OBSOLETE_RESPONSES_DIR
is set to 1, it’ll create a directory and all requests will be saved in three files:0X-url-request.txt
0X-url-response.txt
0X-url.EXT
Information about which files are created is display in logs.
Also if
WOOB_CURLIFY_REQUEST
is set to 1,0X-url-request.txt
will be filled with a ready to use curl command based on the request.- Parameters:
response (
requests.Response
) – the response to savewarning (bool) – if True, display the saving logs as warnings (default to False) (default:
False
)
- location(url, **kwargs)[source]¶
Like
open()
but also changes the current URL and response. This is the most common method to request web pages.Other than that, has the exact same behavior of
open()
.- Return type:
- open(url, *, referrer=None, allow_redirects=True, stream=None, timeout=None, verify=None, cert=None, proxies=None, data_encoding=None, is_async=False, callback=None, **kwargs)[source]¶
- Make an HTTP request like a browser does:
follow redirects (unless disabled)
provide referrers (unless disabled)
Unless a
method
is explicitly provided, it makes a GET request, or a POST if data is not None, An emptydata
(like''
or{}
, notNone
) will make a POST.It is a wrapper around session.request(). All
session.request()
options are available. You should uselocation()
oropen()
and notsession.request()
, since it has some interesting additions, which are easily individually disabled through the arguments.Call this instead of
location()
if you do not want to “visit” the URL (for instance, you are downloading a file).When
is_async
isTrue
,open()
returns aFuture
object (seeconcurrent.futures
for more details), which can be evaluated with itsresult()
method. If any exception is raised while processing request, it is caught and re-raised when callingresult()
.For example:
>>> Browser().open('https://google.com', is_async=True).result().text
- Parameters:
params – (optional) Dictionary, list of tuples or bytes to send in the query string
data – (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body
json – (optional) A JSON serializable Python object to send in the body
headers – (optional) Dictionary of HTTP Headers to send
cookies – (optional) Dict or CookieJar object to send
files – (optional) Dictionary of
'name': file-like-objects
(or{'name': file-tuple}
) for multipart encoding upload.file-tuple
can be a 2-tuple('filename', fileobj)
, 3-tuple('filename', fileobj, 'content_type')
or a 4-tuple('filename', fileobj, 'content_type', custom_headers)
, where'content-type'
is a string defining the content type of the given file andcustom_headers
a dict-like object containing additional headers to add for the file.auth – (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
referrer (str or False or None) – (optional) Force referrer. False to disable sending it, None for guessing (default:
None
)allow_redirects (bool) – (optional) if
True
, follow HTTP redirects (default:True
) (default:True
)stream (
bool
|None
) – (optional) ifFalse
, the response content will be immediately downloaded. (default:None
)timeout (float or tuple) – (optional) How many seconds to wait for the server to send data (default:
None
) before giving up, as a float, or a tuple.verify (
str
|bool
|None
) – (optional) Either a boolean, in which case it controls whether we verify (default:None
) the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. If not provided, uses theBrowser.VERIFY
class attribute value, theBrowser.verify
attribute one, orTrue
.cert (
str
|Tuple
[str
,str
] |None
) – (optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert’, ‘key’) pair. (default:None
)proxies (
Dict
|None
) – (optional) Dictionary mapping protocol to the URL of the proxy. (default:None
)is_async (bool) – (optional) Process request in a non-blocking way (default:
False
) (default:False
)callback (callable) – (optional) Callback to be called when request has finished, (default:
None
) with response as its first and only argument
- Returns:
requests.Response
object- Return type:
- raise_for_status(response)[source]¶
Like
requests.Response.raise_for_status()
but will use other exception specific classes:HTTPNotFound
for 404ClientError
for 4xx errorsServerError
for 5xx errors
- build_request(url, *, referrer=None, data_encoding=None, **kwargs)[source]¶
Does the same job as
open()
, but returns aRequest
without submitting it. This allows further customization to theRequest
.- Return type:
- prepare_request(req)[source]¶
Get a prepared request from a
Request
object.This method aims to be overloaded by children classes.
- Return type:
- REFRESH_RE = re.compile('^(?P<sleep>[\\d\\.]+)(;\\s*url=[\\"\']?(?P<url>.*?)[\\"\']?)?$', re.IGNORECASE)¶
- handle_refresh(response)[source]¶
Called by open, to handle Refresh HTTP header.
It only redirect to the refresh URL if the sleep time is inferior to
REFRESH_MAX
.- Return type:
- get_referrer(oldurl, newurl)[source]¶
Get the referrer to send when doing a request. If we should not send a referrer, it will return None.
Reference: https://en.wikipedia.org/wiki/HTTP_referer
The behavior can be controlled through the ALLOW_REFERRER attribute. True always allows the referers to be sent, False never, and None only if it is within the same domain.
- exception UrlNotAllowed[source]¶
Bases:
Exception
Raises by
DomainBrowser
whenRESTRICT_URL
is set and trying to go on an url not matchingBASEURL
.
- class DomainBrowser(baseurl=None, *args, **kwargs)[source]¶
Bases:
Browser
A browser that handles relative URLs and can have a base URL (usually a domain).
For instance
self.location('/hello')
will get https://woob.tech/hello ifBASEURL
is'https://woob.tech/'
.>>> class ExampleBrowser(DomainBrowser): ... BASEURL = 'https://example.org' ... >>> with ExampleBrowser() as browser: ... browser.open('/') ... <Response [200]>
-
RESTRICT_URL:
ClassVar
[bool
|List
[str
]] = False¶ URLs allowed to load. This can be used to force SSL (if the
BASEURL
is SSL) or any other leakage. Set toTrue
to allow only URLs starting by theBASEURL
. Set it to a list of allowed URLs if you have multiple allowed URLs. More complex behavior is possible by overloadingurl_allowed()
.
- absurl(uri, base=None)[source]¶
Get the absolute URL, relative to a base URL. If base is
None
, it will try to use the current URL. If there is no current URL, it will try to useBASEURL
.If base is
False
, it will always try to use the current URL. If base isTrue
, it will always try to use BASEURL.
- open(url, *args, **kwargs)[source]¶
Like
Browser.open()
but handles urls without domains, using theBASEURL
attribute.- Return type:
-
RESTRICT_URL:
- class PagesBrowser(*args, **kwargs)[source]¶
Bases:
DomainBrowser
A browser which works pages and keep state of navigation.
To use it, you have to derive it and to create
URL
objects as class attributes. Whenopen()
orlocation()
are called, if the url matches one ofURL
objects, it returns aPage
object. In case oflocation()
, it stores it inself.page
.Example:
>>> import re >>> from .pages import HTMLPage >>> class ListPage(HTMLPage): ... def get_items(self): ... for link in self.doc.xpath('//a[matches(@href, "list-\d+.html")]/@href'): ... yield re.match('list-(\d+).html', link).group(1) ... >>> class ItemPage(HTMLPage): ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... >>> class MyBrowser(PagesBrowser): ... BASEURL = 'https://woob.tech/tests/' ... list = URL(r'$', ListPage) ... item = URL(r'list-(?P<id>\d+)\.html', ItemPage) ... >>> b = MyBrowser() >>> b.list.go() <woob.browser.browsers.ListPage object at 0x...> >>> b.page.url 'https://woob.tech/tests/' >>> list(b.page.get_items()) ['1', '2'] >>> b.item.build(id=42) 'https://woob.tech/tests/list-42.html' >>> b.item.go(id=1) <woob.browser.browsers.ItemPage object at 0x...> >>> list(b.page.iter_values()) ['One', 'Two']
- open(*args, **kwargs)[source]¶
Same method than
open()
, but the response contains an attributepage
if the url matches anyURL
object.- Return type:
- location(*args, **kwargs)[source]¶
Same method than
location()
, but if the url matches anyURL
object, an attributepage
is added to response, and the attributepage
is set on the browser.- Return type:
- pagination(func, *args, **kwargs)[source]¶
This helper function can be used to handle pagination pages easily.
When the called function raises an exception
NextPage
, it goes on the wanted page and recall the function.NextPage
constructor can take an url or a Request object.>>> from .pages import HTMLPage >>> class Page(HTMLPage): ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... for next in self.doc.xpath('//a'): ... raise NextPage(next.attrib['href']) ... >>> class Browser(PagesBrowser): ... BASEURL = 'https://woob.tech' ... list = URL('/tests/list-(?P<pagenum>\d+).html', Page) ... >>> b = Browser() >>> b.list.go(pagenum=1) <woob.browser.browsers.Page object at 0x...> >>> list(b.pagination(lambda: b.page.iter_values())) ['One', 'Two', 'Three', 'Four']
- need_login(func)[source]¶
Decorator used to require to be logged to access to this function.
This decorator can be used on any method whose first argument is a browser (typically a
LoginBrowser
). It checks for thelogged
attribute in the current browser’s page: when this attribute is set toTrue
(e.g., when the page inheritsLoggedPage
), then nothing special happens.In all other cases (when the browser isn’t on any defined page or when the page’s
logged
attribute isFalse
), theLoginBrowser.do_login()
method of the browser is called before calling :func.
- class LoginBrowser(username, password, *args, **kwargs)[source]¶
Bases:
PagesBrowser
A browser which supports login.
- class StatesMixin[source]¶
Bases:
object
Mixin to store states of browser.
It saves and loads a
state
dict object. By default it contains the current url and cookies, but may be overriden by the subclass to store its specific stuff.-
STATE_DURATION:
ClassVar
[int
|float
|None
] = None¶ In minutes, used to set an expiration datetime object of the state.
- get_expire()[source]¶
Get expiration of the
state
object, using theSTATE_DURATION
class attribute.
-
STATE_DURATION:
- class APIBrowser(baseurl=None, *args, **kwargs)[source]¶
Bases:
DomainBrowser
A browser for API websites.
- build_request(*args, **kwargs)[source]¶
Does the same job as
open()
, but returns aRequest
without submitting it. This allows further customization to theRequest
.- Return type:
- exception AbstractBrowserMissingParentError[source]¶
Bases:
Exception
Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class MetaBrowser(name, bases, dct)[source]¶
Bases:
type
Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class AbstractBrowser[source]¶
Bases:
object
Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class OAuth2Mixin(*args, **kwargs)[source]¶
Bases:
StatesMixin
- AUTHORIZATION_URI: ClassVar[str, None] = None¶
OAuth2 Authorization URI.
- ACCESS_TOKEN_URI: ClassVar[str, None] = None¶
OAuth2 route to exchange a code with an access_token.
- SCOPE: ClassVar[str] = ''¶
OAuth2 scope.
- client_id: str | None = None¶
- client_secret: str | None = None¶
- redirect_uri: str | None = None¶
- access_token: str | None = None¶
- access_token_expire: datetime | None = None¶
- auth_uri: str | None = None¶
- token_type: str | None = None¶
- refresh_token: str | None = None¶
- oauth_state: str | None = None¶
- authorized_date: str | None = None¶
- callback_error_description = ('operation canceled by the client', 'login cancelled', 'consent denied', 'psu cancelled the transaction')¶
- class OAuth2PKCEMixin(*args, **kwargs)[source]¶
Bases:
OAuth2Mixin
- class DigestMixin[source]¶
Bases:
object
Browser mixin to add a
Digest
header compliant with RFC 3230 section 4.3.2.-
HTTP_DIGEST_ALGORITHM:
str
= 'SHA-256'¶ Digest algorithm used to obtain a hash of the request content.
The only supported digest algorithm for now is ‘SHA-256’.
-
HTTP_DIGEST_METHODS:
tuple
[str
, Ellipsis] |None
= ('GET', 'POST', 'PUT', 'DELETE')¶ The list of HTTP methods on which to add a
Digest
header.To add the
Digest
header to all methods, set this constant to None.
-
HTTP_DIGEST_COMPACT_JSON:
bool
= False¶ If the content type of the request payload is JSON, compact it first.
- add_digest_header(preq)[source]¶
Add the
Digest
header to the prepared request.The
Digest
header presence depends on the request:If the request has a
HTTP_DIGEST_INCLUDE
header, theDigest
header is added.Otherwise, if the request has a
HTTP_DIGEST_EXCLUDE
header, theDigest
header is not added.Otherwise, if
HTTP_DIGEST_METHOD
is an HTTP method list and the request method is not in said list, theDigest
header is not added.Otherwise, the
Digest
header is added.
Note that the
HTTP_DIGEST_INCLUDE
andHTTP_DIGEST_EXCLUDE
headers are removed from the request before sending it.- Parameters:
preq (
PreparedRequest
) – The prepared request on which theDigest
header is added.- Return type:
class MyBrowser(DigestMixin, Browser): HTTP_DIGEST_METHODS = ('POST', 'PUT', 'DELETE') my_browser = MyBrowser() my_browser.open('https://example.org/')
-
HTTP_DIGEST_ALGORITHM: