`woob.browser.pages`¶

pagination(func)[source]¶

This helper decorator can be used to handle pagination pages easily.

When the called function raises an exception NextPage, it goes on the wanted page and recall the function.

NextPage constructor can take an url or a Request object.

>>> class Page(HTMLPage):
...     @pagination
...     def iter_values(self):
...         for el in self.doc.xpath('//li'):
...             yield el.text
...         for next in self.doc.xpath('//a'):
...             raise NextPage(next.attrib['href'])
...
>>> from .browsers import PagesBrowser
>>> from .url import URL
>>> class Browser(PagesBrowser):
...     BASEURL = 'https://woob.tech'
...     list = URL('/tests/list-(?P<pagenum>\d+).html', Page)
...
>>> b = Browser()
>>> b.list.go(pagenum=1) 
<woob.browser.pages.Page object at 0x...>
>>> list(b.page.iter_values())
['One', 'Two', 'Three', 'Four']

exception NextPage(request)[source]¶

Bases: Exception

Exception used for example in a Page to tell PagesBrowser.pagination to go on the next page.

See PagesBrowser.pagination() or decorator pagination().

class Page(browser, response, params=None, encoding=None)[source]¶

Bases: object

Represents a page.

Encoding can be forced by setting the ENCODING class-wide attribute, or by passing an encoding keyword argument, which overrides ENCODING. Finally, it can be manually changed by assigning a new value to encoding instance attribute. A unicode version of the response content is accessible in text, decoded with specified encoding.

Parameters:

browser (woob.browser.browsers.Browser) – browser used to go on the page
response (Response) – response object
params (dict) – optional dictionary containing parameters given to the page (see woob.browser.url.URL) (default: None)
encoding (str) – optional parameter to force the encoding of the page, overrides ENCODING (default: None)

ENCODING: ClassVar[str | None] = None¶: Force a page encoding. It is recommended to use None for autodetection.

is_here: None | bool | _Filter | Callable | str = None¶

The condition to verify that the page corresponds to the response.

This allows having different pages on equivalent or conflicting URL patterns identified using the response’s method, URL, headers, or content, by defining is_here on pages associated with such patterns.

This property can be defined as:

None or True, to signify that the page should be matched regardless of the response.
False, to signify that the page should not be matched regardless of the response.
A filter returning a falsy or non-falsy object, evaluated with the constructed document for the page.
A method returning a falsy or non-falsy object, evaluated with the page object directly.

logged: bool = False¶: If True, the page is in a restricted area of the website. Useful with LoginBrowser and the need_login() decorator.

property encoding: str | None¶

property content: bytes¶: Raw content from response.

property text: str¶: Content of the response, in str, decoded with encoding.

property data: Any¶: Data passed to build_doc().

on_load()[source]¶: Event called when browser loads this page.

on_leave()[source]¶: Event called when browser leaves this page.

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: Any

detect_encoding()[source]¶

Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).

Return type:: None | str

normalize_encoding(encoding)[source]¶

Make sure we can easily compare encodings by formatting them the same way.

Return type:: str | None

absurl(url)[source]¶

Get an absolute URL from an a partial URL, relative to the Page URL

Return type:: str

exception FormNotFound[source]¶

Bases: Exception

Raised when HTMLPage.get_form() can’t find a form.

exception FormSubmitWarning[source]¶

Bases: UserWarning

A form has more than one submit element selected, and will likely generate an invalid request.

class Form(page, el, submit_el=None)[source]¶

Bases: OrderedDict

Represents a form of an HTML page.

It is used as a dict with pre-filled values from HTML. You can set new values as strings by setting an item value.

It is recommended to not use this class by yourself, but call HTMLPage.get_form().

Parameters:

page (Page) – the page where the form is located
el (_Element) – the form element on the page
submit_el (_Element | None) – allows you to only consider one submit button (which is (default: None) what browsers do). If set to None, it takes all of them, and if set to False, it takes none.

property request: Request¶: Get the Request object from the form.

submit(**kwargs)[source]¶

Submit the form and tell browser to be located to the new page.

Parameters:: data_encoding (str) – force encoding used to submit form data (defaults to the current page encoding)
Return type:: Response

class CsvPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

Page which parses CSV files.

DIALECT: ClassVar[str] = 'excel'¶: Dialect given to the csv module.

FMTPARAMS: ClassVar[dict] = {}¶: Parameters given to the csv module.

ENCODING: ClassVar[str | None] = 'utf-8'¶: Encoding of the file.

NEWLINES_HACK: ClassVar[bool] = True¶: Convert all strange newlines to unix ones.

HEADER: ClassVar[int | None] = None¶: If not None, will consider the line represented by this index as a header. This means the rows will be also available as dictionaries.

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: list

parse(data, encoding=None)[source]¶

Method called by the constructor of CsvPage to parse the document.

Parameters:

data (BytesIO) – file stream
encoding (str) – if given, use it to decode cell strings (default: None)

Return type:

list

decode_row(row, encoding)[source]¶

Method called by CsvPage.parse() to decode a row using the given encoding.

Return type:: list

class JsonPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

Json Page.

Notes on JSON format: JSON must be UTF-8 encoded when used for open systems interchange (https://tools.ietf.org/html/rfc8259). So it can be safely assumed all JSON to be UTF-8. A little subtlety is that JSON Unicode surrogate escape sequence (used for characters > U+FFFF) are UTF-16 style, but that should be handled by libraries (some don’t… Even if JSON is one of the simplest formats around…).

ENCODING: ClassVar[str | None] = 'utf-8-sig'¶: Force a page encoding. It is recommended to use None for autodetection.

property data: str¶: Data passed to build_doc().

get(path, default=None)[source]¶

Return type:: Any

path(path, context=None)[source]¶

Return type:: Iterator

build_doc(text)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: dict | list

class XLSPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

XLS Page.

HEADER = None¶: If not None, will consider the line represented by this index as a header.

SHEET_INDEX = 0¶: Specify the index of the worksheet to use.

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: list

parse(data)[source]¶

Method called by the constructor of XLSPage to parse the document.

Return type:: list

class XMLPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

XML Page.

detect_encoding()[source]¶

Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).

Return type:: str | None

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: _Element

class RawPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

Raw page where the “doc” attribute is the content string.

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: bytes

class HTMLPage(*args, **kwargs)[source]¶

Bases: Page

HTML page.

Parameters:

browser (woob.browser.browsers.Browser) – browser used to go on the page
response (Response) – response object
params (dict) – optional dictionary containing parameters given to the page (see woob.browser.url.URL)
encoding (str) – optional parameter to force the encoding of the page

FORM_CLASS¶

The class to instanciate when using HTMLPage.get_form(). Default to Form.

alias of Form

REFRESH_MAX: ClassVar[int | None] = None¶

When handling a “Refresh” meta header, the page considers it only if the sleep time in lesser than this value.

Default value is None, means refreshes aren’t handled.

REFRESH_XPATH: ClassVar[str] = '//head//meta[lower-case(@http-equiv)="refresh"]'¶: Default xpath, which is also the most commun, override it if needed

ABSOLUTE_LINKS: ClassVar[bool] = False¶: Make links URLs absolute.

on_load()[source]¶: Event called when browser loads this page.

handle_refresh()[source]¶

classmethod setup_xpath_functions()[source]¶

classmethod define_xpath_functions(ns)[source]¶

Define XPath functions on the given lxml function namespace.

This method is called in constructor of HTMLPage and can be overloaded by children classes to add extra functions.

build_doc(content)[source]¶

Method to build the lxml document from response and given encoding.

Return type:: _ElementTree

detect_encoding()[source]¶

Look for encoding in the document “http-equiv” and “charset” meta nodes.

Return type:: str

get_form(xpath='//form', name=None, id=None, nr=None, submit=None)[source]¶

Get a Form object from a selector. The form will be analyzed and its parameters extracted. In the case there is more than one “submit” input, only one of them should be chosen to generate the request.

Parameters:

xpath (str) – xpath string to select forms (default: "//form")
name (str) – if supplied, select a form with the given name (default: None)
nr (int) – if supplied, take the n+1 th selected form (default: None)
submit (str) – if supplied, xpath string to select the submit element from the form (default: None)

Return type:

Form

Raises:

FormNotFound if no form is found

class PartialHTMLPage(*args, **kwargs)[source]¶

Bases: HTMLPage

HTML page for broken pages with multiple roots.

This class should typically be used for requests which return only a part of a full document, to insert in another document. Such a sub-document can have multiple root tags, so this class is required in this case.

build_doc(content)[source]¶

Method to build the lxml document from response and given encoding.

Return type:: _ElementTree

class GWTPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

GWT page where the “doc” attribute is a list

More info about GWT protcol here : https://goo.gl/GP5dv9

build_doc(content)[source]¶

Reponse starts with “//” followed by “OK” or “EX”. 2 last elements in list are protocol and flag. We need to read the list in reversed order.

Return type:: list

get_date(data)[source]¶

Get date from string

Return type:: str

get_elements(type='String')[source]¶

Get elements of specified type

Return type:: list

class PDFPage(browser, response, params=None, encoding=None)[source]¶

Bases: Page

Parse a PDF and write raw data in the “doc” attribute as a string.

build_doc(content)[source]¶

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

Return type:: bytes

class LoggedPage[source]¶

Bases: object

A page that only logged users can reach. If we did not get a redirection for this page, we are sure that the login is still active.

Do not use this class for page with mixed content (logged/anonymous) or for pages with a login form.

logged: bool = True¶

exception AbstractPageError[source]¶: Bases: Exception

class MetaPage(name, bases, dct)[source]¶: Bases: type

class AbstractPage[source]¶: Bases: object

Deprecated since version 3.4: Don’t use this class, import woob_modules.other_module.etc instead

class LoginPage[source]¶

Bases: object

on_load()[source]¶

`woob.browser.pages`¶

Navigation

External links

Related Topics

woob.browser.pages¶

`woob.browser.pages`¶