woob.browser.pages
¶
- pagination(func)[source]¶
This helper decorator can be used to handle pagination pages easily.
When the called function raises an exception
NextPage
, it goes on the wanted page and recall the function.NextPage
constructor can take an url or a Request object.>>> class Page(HTMLPage): ... @pagination ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... for next in self.doc.xpath('//a'): ... raise NextPage(next.attrib['href']) ... >>> from .browsers import PagesBrowser >>> from .url import URL >>> class Browser(PagesBrowser): ... BASEURL = 'https://woob.tech' ... list = URL('/tests/list-(?P<pagenum>\d+).html', Page) ... >>> b = Browser() >>> b.list.go(pagenum=1) <woob.browser.pages.Page object at 0x...> >>> list(b.page.iter_values()) ['One', 'Two', 'Three', 'Four']
- exception NextPage(request)[source]¶
Bases:
Exception
Exception used for example in a Page to tell PagesBrowser.pagination to go on the next page.
See
PagesBrowser.pagination()
or decoratorpagination()
.
- class Page(browser, response, params=None, encoding=None)[source]¶
Bases:
object
Represents a page.
Encoding can be forced by setting the
ENCODING
class-wide attribute, or by passing an encoding keyword argument, which overridesENCODING
. Finally, it can be manually changed by assigning a new value toencoding
instance attribute. A unicode version of the response content is accessible intext
, decoded with specifiedencoding
.- Parameters:
browser (
woob.browser.browsers.Browser
) – browser used to go on the pageresponse (
Response
) – response objectparams (
dict
) – optional dictionary containing parameters given to the page (seewoob.browser.url.URL
) (default:None
)encoding (
str
) – optional parameter to force the encoding of the page, overridesENCODING
(default:None
)
-
ENCODING:
ClassVar
[str
|None
] = None¶ Force a page encoding. It is recommended to use None for autodetection.
-
is_here:
None
|bool
|_Filter
|Callable
|str
= None¶ The condition to verify that the page corresponds to the response.
This allows having different pages on equivalent or conflicting URL patterns identified using the response’s method, URL, headers, or content, by defining is_here on pages associated with such patterns.
This property can be defined as:
None or True, to signify that the page should be matched regardless of the response.
False, to signify that the page should not be matched regardless of the response.
A filter returning a falsy or non-falsy object, evaluated with the constructed document for the page.
A method returning a falsy or non-falsy object, evaluated with the page object directly.
-
logged:
bool
= False¶ If True, the page is in a restricted area of the website. Useful with
LoginBrowser
and theneed_login()
decorator.
- property data: Any¶
Data passed to
build_doc()
.
- build_doc(content)[source]¶
Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.- Return type:
- detect_encoding()[source]¶
Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).
- exception FormNotFound[source]¶
Bases:
Exception
Raised when
HTMLPage.get_form()
can’t find a form.
- exception FormSubmitWarning[source]¶
Bases:
UserWarning
A form has more than one submit element selected, and will likely generate an invalid request.
- class Form(page, el, submit_el=None)[source]¶
Bases:
OrderedDict
Represents a form of an HTML page.
It is used as a dict with pre-filled values from HTML. You can set new values as strings by setting an item value.
It is recommended to not use this class by yourself, but call
HTMLPage.get_form()
.- Parameters:
- class CsvPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
Page which parses CSV files.
-
HEADER:
ClassVar
[int
|None
] = None¶ If not None, will consider the line represented by this index as a header. This means the rows will be also available as dictionaries.
- build_doc(content)[source]¶
Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.- Return type:
- parse(data, encoding=None)[source]¶
Method called by the constructor of
CsvPage
to parse the document.
- decode_row(row, encoding)[source]¶
Method called by
CsvPage.parse()
to decode a row using the given encoding.- Return type:
-
HEADER:
- class JsonPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
Json Page.
Notes on JSON format: JSON must be UTF-8 encoded when used for open systems interchange (https://tools.ietf.org/html/rfc8259). So it can be safely assumed all JSON to be UTF-8. A little subtlety is that JSON Unicode surrogate escape sequence (used for characters > U+FFFF) are UTF-16 style, but that should be handled by libraries (some don’t… Even if JSON is one of the simplest formats around…).
-
ENCODING:
ClassVar
[str
|None
] = 'utf-8-sig'¶ Force a page encoding. It is recommended to use None for autodetection.
- property data: str¶
Data passed to
build_doc()
.
-
ENCODING:
- class XLSPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
XLS Page.
- HEADER = None¶
If not None, will consider the line represented by this index as a header.
- SHEET_INDEX = 0¶
Specify the index of the worksheet to use.
- class XMLPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
XML Page.
- class RawPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
Raw page where the “doc” attribute is the content string.
- class HTMLPage(*args, **kwargs)[source]¶
Bases:
Page
HTML page.
- Parameters:
browser (
woob.browser.browsers.Browser
) – browser used to go on the pageresponse (
Response
) – response objectparams (
dict
) – optional dictionary containing parameters given to the page (seewoob.browser.url.URL
)encoding (
str
) – optional parameter to force the encoding of the page
- FORM_CLASS¶
The class to instanciate when using
HTMLPage.get_form()
. Default toForm
.alias of
Form
-
REFRESH_MAX:
ClassVar
[int
|None
] = None¶ When handling a “Refresh” meta header, the page considers it only if the sleep time in lesser than this value.
Default value is None, means refreshes aren’t handled.
-
REFRESH_XPATH:
ClassVar
[str
] = '//head//meta[lower-case(@http-equiv)="refresh"]'¶ Default xpath, which is also the most commun, override it if needed
- classmethod define_xpath_functions(ns)[source]¶
Define XPath functions on the given lxml function namespace.
This method is called in constructor of
HTMLPage
and can be overloaded by children classes to add extra functions.
- build_doc(content)[source]¶
Method to build the lxml document from response and given encoding.
- Return type:
_ElementTree
- detect_encoding()[source]¶
Look for encoding in the document “http-equiv” and “charset” meta nodes.
- Return type:
- get_form(xpath='//form', name=None, id=None, nr=None, submit=None)[source]¶
Get a
Form
object from a selector. The form will be analyzed and its parameters extracted. In the case there is more than one “submit” input, only one of them should be chosen to generate the request.- Parameters:
xpath (
str
) – xpath string to select forms (default:'//form'
)name (
str
) – if supplied, select a form with the given name (default:None
)nr (
int
) – if supplied, take the n+1 th selected form (default:None
)submit (
str
) – if supplied, xpath string to select the submit element from the form (default:None
)
- Return type:
- Raises:
FormNotFound
if no form is found
- class PartialHTMLPage(*args, **kwargs)[source]¶
Bases:
HTMLPage
HTML page for broken pages with multiple roots.
This class should typically be used for requests which return only a part of a full document, to insert in another document. Such a sub-document can have multiple root tags, so this class is required in this case.
- class GWTPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
GWT page where the “doc” attribute is a list
More info about GWT protcol here : https://goo.gl/GP5dv9
- class PDFPage(browser, response, params=None, encoding=None)[source]¶
Bases:
Page
Parse a PDF and write raw data in the “doc” attribute as a string.
- class LoggedPage[source]¶
Bases:
object
A page that only logged users can reach. If we did not get a redirection for this page, we are sure that the login is still active.
Do not use this class for page with mixed content (logged/anonymous) or for pages with a login form.