woob.browser.browsers
¶
- class Browser(*args, **kwargs)[source]¶
Bases:
object
Simple browser class. Act like a browser, and don’t try to do too much.
- PROFILE = <woob.browser.profiles.Firefox object>¶
Default profile used by browser to navigate on websites.
- TIMEOUT = 10.0¶
Default timeout during requests.
- REFRESH_MAX = 0.0¶
When handling a Refresh header, the browsers considers it only if the sleep time in lesser than this value.
- MAX_RETRIES = 2¶
Maximum retries on failed requests.
- MAX_WORKERS = 10¶
Maximum of threads for asynchronous requests.
- ALLOW_REFERRER = True¶
Controls the behavior of get_referrer.
- HTTP_ADAPTER_CLASS¶
Adapter class to use.
alias of
HTTPAdapter
- COOKIE_POLICY = None¶
Default CookieJar policy. Example: woob.browser.cookies.BlockAllCookies()
- VERIFY = True¶
Check SSL certificates.
- location(url, **kwargs)[source]¶
Like
open()
but also changes the current URL and response. This is the most common method to request web pages.Other than that, has the exact same behavior of open().
- open(url, *, referrer=None, allow_redirects=True, stream=None, timeout=None, verify=None, cert=None, proxies=None, data_encoding=None, is_async=False, callback=<function Browser.<lambda>>, **kwargs)[source]¶
- Make an HTTP request like a browser does:
follow redirects (unless disabled)
provide referrers (unless disabled)
Unless a method is explicitly provided, it makes a GET request, or a POST if data is not None, An empty data (not None, like ‘’ or {}) will make a POST.
It is a wrapper around session.request(). All session.request() options are available. You should use location() or open() and not session.request(), since it has some interesting additions, which are easily individually disabled through the arguments.
Call this instead of location() if you do not want to “visit” the URL (for instance, you are downloading a file).
When is_async is True, open() returns a Future object (see concurrent.futures for more details), which can be evaluated with its result() method. If any exception is raised while processing request, it is caught and re-raised when calling result().
For example:
>>> Browser().open('http://google.com', is_async=True).result().text
- Parameters
url (str or dict or None) – URL
data – POST data
referrer (str or False or None) – Force referrer. False to disable sending it, None for guessing
is_async (bool) – Process request in a non-blocking way
callback (function) – Callback to be called when request has finished, with response as its first and only argument
- Return type
requests.Response
- raise_for_status(response)[source]¶
Like Response.raise_for_status but will use other classes if needed.
- build_request(url, *, referrer=None, data_encoding=None, **kwargs)[source]¶
Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.
- prepare_request(req)[source]¶
Get a prepared request from a Request object.
This method aims to be overloaded by children classes.
- REFRESH_RE = re.compile('^(?P<sleep>[\\d\\.]+)(;\\s*url=[\\"\']?(?P<url>.*?)[\\"\']?)?$', re.IGNORECASE)¶
- handle_refresh(response)[source]¶
Called by open, to handle Refresh HTTP header.
It only redirect to the refresh URL if the sleep time is inferior to REFRESH_MAX.
- get_referrer(oldurl, newurl)[source]¶
Get the referrer to send when doing a request. If we should not send a referrer, it will return None.
Reference: https://en.wikipedia.org/wiki/HTTP_referer
The behavior can be controlled through the ALLOW_REFERRER attribute. True always allows the referers to be sent, False never, and None only if it is within the same domain.
- Parameters
oldurl (str or None) – Current absolute URL
newurl (str) – Target absolute URL
- Return type
str or None
- exception UrlNotAllowed[source]¶
Bases:
Exception
Raises by
DomainBrowser
when RESTRICT_URL is set and trying to go on an url not matching BASEURL.
- class DomainBrowser(*args, **kwargs)[source]¶
Bases:
Browser
A browser that handles relative URLs and can have a base URL (usually a domain).
For instance self.location(‘/hello’) will get http://woob.tech/hello if BASEURL is ‘http://woob.tech/’.
- RESTRICT_URL = False¶
URLs allowed to load. This can be used to force SSL (if the BASEURL is SSL) or any other leakage. Set to True to allow only URLs starting by the BASEURL. Set it to a list of allowed URLs if you have multiple allowed URLs. More complex behavior is possible by overloading url_allowed()
- BASEURL = None¶
Base URL, e.g. ‘http://woob.tech/’ or ‘https://woob.tech/’ See absurl().
- url_allowed(url)[source]¶
Checks if we are allowed to visit an URL. See RESTRICT_URL.
- Parameters
url (str) – Absolute URL
- Return type
bool
- absurl(uri, base=None)[source]¶
Get the absolute URL, relative to a base URL. If base is None, it will try to use the current URL. If there is no current URL, it will try to use BASEURL.
If base is False, it will always try to use the current URL. If base is True, it will always try to use BASEURL.
- Parameters
uri (str) – URI to make absolute. It can be already absolute.
base (str or None or False or True) – Base absolute URL.
- Return type
str
- open(req, *args, **kwargs)[source]¶
Like
Browser.open()
but handles urls without domains, using theBASEURL
attribute.
- class PagesBrowser(*args, **kwargs)[source]¶
Bases:
DomainBrowser
A browser which works pages and keep state of navigation.
To use it, you have to derive it and to create URL objects as class attributes. When open() or location() are called, if the url matches one of URL objects, it returns a Page object. In case of location(), it stores it in self.page.
Example:
>>> from .pages import HTMLPage >>> class ListPage(HTMLPage): ... def get_items(): ... return [el.attrib['id'] for el in self.doc.xpath('//div[@id="items"]/div')] ... >>> class ItemPage(HTMLPage): ... pass ... >>> class MyBrowser(PagesBrowser): ... BASEURL = 'http://example.org/' ... list = URL('list-items', ListPage) ... item = URL('item/view/(?P<id>\d+)', ItemPage) ... >>> MyBrowser().list.stay_or_go().get_items() >>> bool(MyBrowser().list.match('http://example.org/list-items')) True >>> bool(MyBrowser().list.match('http://example.org/')) False >>> str(MyBrowser().item.build(id=42)) 'http://example.org/item/view/42'
You can then use URL instances to go on pages.
- open(*args, **kwargs)[source]¶
Same method than
woob.browser.browsers.DomainBrowser.open()
, but the response contains an attribute page if the url matches anyURL
object.
- location(*args, **kwargs)[source]¶
Same method than
woob.browser.browsers.Browser.location()
, but if the url matches anyURL
object, an attribute page is added to response, and the attributePagesBrowser.page
is set.
- pagination(func, *args, **kwargs)[source]¶
This helper function can be used to handle pagination pages easily.
When the called function raises an exception
NextPage
, it goes on the wanted page and recall the function.NextPage
constructor can take an url or a Request object.>>> from .pages import HTMLPage >>> class Page(HTMLPage): ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... for next in self.doc.xpath('//a'): ... raise NextPage(next.attrib['href']) ... >>> class Browser(PagesBrowser): ... BASEURL = 'https://woob.tech' ... list = URL('/tests/list-(?P<pagenum>\d+).html', Page) ... >>> b = Browser() >>> b.list.go(pagenum=1) <woob.browser.browsers.Page object at 0x...> >>> list(b.pagination(lambda: b.page.iter_values())) ['One', 'Two', 'Three', 'Four']
- need_login(func)[source]¶
Decorator used to require to be logged to access to this function.
This decorator can be used on any method whose first argument is a browser (typically a
LoginBrowser
). It checks for the logged attribute in the current browser’s page: when this attribute is set toTrue
(e.g., when the page inheritsLoggedPage
), then nothing special happens.In all other cases (when the browser isn’t on any defined page or when the page’s logged attribute is
False
), theLoginBrowser.do_login()
method of the browser is called before calling :func.
- class LoginBrowser(*args, **kwargs)[source]¶
Bases:
PagesBrowser
A browser which supports login.
- class StatesMixin[source]¶
Bases:
object
Mixin to store states of browser.
- STATE_DURATION = None¶
In minutes, used to set an expiration datetime object of the state.
- class APIBrowser(*args, **kwargs)[source]¶
Bases:
DomainBrowser
A browser for API websites.
- build_request(*args, **kwargs)[source]¶
Does the same job as open(), but returns a Request without submitting it. This allows further customization to the Request.
- class AbstractBrowser[source]¶
Bases:
object
Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead
- class OAuth2Mixin(*args, **kwargs)[source]¶
Bases:
StatesMixin
- AUTHORIZATION_URI = None¶
- ACCESS_TOKEN_URI = None¶
- SCOPE = ''¶
- client_id = None¶
- client_secret = None¶
- redirect_uri = None¶
- access_token = None¶
- access_token_expire = None¶
- auth_uri = None¶
- token_type = None¶
- refresh_token = None¶
- oauth_state = None¶
- authorized_date = None¶
- property logged¶