Cookbook¶

POST an HTML form¶

class LoginPage(HTMLPage):
    def login(self, username, password):
        form = self.get_form(name='something') # will select //form[@name="something"]
        form['input_name_1'] = username        # will select .//input[@name="input_name_1"]
        form['input_name_2'] = password
        form.submit()                          # will submit the form with the browser's location()

POST form data¶

The data argument specifies the form data to post (it can be a string, dict or OrderedDict):

browser.location('/quux', data={'foo': 'bar'})

Will do:

POST /quux
Content-Type: application/x-www-form-urlencoded

foo=bar

It also works with URL instances:

browser.myurl.go(data={'foo': 'bar'})

POST JSON¶

Data must be encoded as a string:

browser.location('/quux', json={'foo': 'bar'})

Will do:

POST /quux
Content-Type: application/json

{"foo": "bar"}

Equivalent to:

browser.location('/quux', data=json.dumps({'foo': 'bar'}), headers={'Content-Type': 'application/json'})

Add custom headers for one request¶

The headers argument specifies additional headers:

browser.location('/foo', headers={'X-Foo': 1, 'X-Bar': 'hello'})

Or:

browser.myurl.go(headers={'X-Foo': 1, 'X-Bar': 'hello'})

Add headers to all requests¶

The build_request method can be reimplemented to prepare requests:

class MyBrowser(DomainBrowser):
    def build_request(self, *args, **kwargs):
        headers = kwargs.setdefault('headers', {})
        headers['X-Foo'] = 'bar'
        return super(MyBrowser, self).build_request(*args, **kwargs)

Add query-string parameters¶

While data keyword urlencodes and passes in POST, params urlencodes and passes in URL:

browser.location('/qux', params={'foo': 'bar&baz'})

Will do:

GET /qux?foo=bar%26baz

It’s better than:

browser.location('/qux?' + urlencode({'foo': 'bar&baz'}))

It also works fine if some query params are already present:

browser.location('/qux?quack=duck', params={'foo': 'bar&baz'})

Will do:

GET /qux?quack=duck&foo=bar%26baz

Have multiple pages on the same URL¶

Define a is_here attribute in all Page class which have the same URL.

The is_here attr must be a string XPath.

If the XPath is found in the document, the Page will be matched, else other Pages will be checked:

# in browser.py

class MyBrowser(PagesBrowser):
    page_a = URL('/foo', PageA)
    page_b = URL('/foo', PageB)

# in pages.py

class PageA(HTMLPage):
    is_here = '//div[text()="List of As"]'

class PageB(HTMLPage):
    is_here = '//div[text()="Here are the Bs"]'

If an XPath is not enough, is_here can be a method returning a bool:

class PageA(HTMLPage):
    def is_here(self):
        return self.doc.xpath('//div[text()="List of As"]')

Have a page which is sometimes a `LoggedPage`, sometimes isn’t¶

LoggedPage just defines logged = True while other pages define logged = False by default.

To make this attribute variable, use a @property method returning a bool:

class HomePage(HTMLPage):
    @property
    def logged(self):
        return self.doc.xpath('//a[contains(@href,"logout")]')

Skip items in a `ListElement`/`ItemElement`¶

There are multiple ways to skip elements:

class MyPage(HTMLPage):
    @method
    class iter_something(ListElement):
        item_xpath = '//ul/li'

        class item(ItemElement):
            klass = SomeClass

            # condition is called before obj_* parsing takes place
            def condition(self):
                return 'foo' not in self.el.attrib['class']
                # this is a basic example, we could have done
                # item_xpath = '//ul/li[not(has-class("foo"))]'

            def obj_foo(self):
                value = CleanText('.')(self)
                if 'forbidden' in value:
                    raise SkipItem()
                return value

            obj_bar = CleanDecimal(Attr('.', 'number'))

            # validate is called after obj_* parsing is done
            def validate(self, obj):
                return obj.bar != 42

Fix invalid HTML that prevents lxml to be parsed¶

When the document must be fixed before being parsed, build_doc() can be overridden:

class MyPage(HTMLPage):
    def build_doc(self, content):
        content = content.replace(b'\x00', b'') # when the doc erroneously contains null bytes
        return super(MyPage, self).build_doc(content)

Follow HTML `<meta>` redirects¶

Some sites do not use HTTP 3xx redirect codes but HTML <meta> refreshing. HTMLPage can handle it if enabled:

class RedirectPage(HTMLPage):
    REFRESH_MAX = 0

Automatically follow a link on a page¶

Some sites do not even do that and may use Javascript to follow a link. The on_load method is called for each location:

class PortalPage(HTMLPage):
    def on_load(self):
        self.browser.location(Link('//a[@id="target"]')(self.doc))

Poll data, i.e. request an endpoint regularly¶

For some occasions, for example waiting on the user to validate an operation on a mobile app, we poll an endpoint regularly to get a status or some additional data. In this case, the following structure should be adopted:

from woob.tools.misc import polling_loop

# ...

for _ in polling_loop(count=10):
    self.polling_status.go()

    if self.page.get_status() != 'WAITING':
        break
else:
    raise AppValidationError('Still waiting after 10 iterations')

You can also use timeouts, and set the delay:

from woob.tools.misc import polling_loop

# ...

for _ in polling_loop(timeout=300, delay=10):
    # Every 10 seconds.
    self.polling_status.go()

    if self.page.get_status() != 'WAITING':
        break
else:
    raise AppValidationError('Still waiting after 300 seconds')

Parse data from an HTML table¶

This example code isn’t very semantic and could fail silently if columns are changed:

class MyPage(HTMLPage):
    @method
    class iter_stuff(ListElement):
        item_xpath = '//table/tr[pos() > 1]' # data rows

        class item(ItemElement):
            klass = Stuff

            obj_id = CleanText('./td[1]')
            obj_foo = CleanText('./td[2]')

It can be improved by using a TableElement and the column labels:

class MyPage(HTMLPage):
    @method
    class iter_stuff(TableElement):
        head_xpath = '//table/tr/th' # where to look for column titles

        # these are the column titles from the site
        col_id = 'Identifier'  # Exact match
        col_foo = re.compile(r'^Foo value for today \(.*\)')  # regexp for finer matching
        col_bar = ['Bar', 'Barr']  # Multiple exact matches

        item_xpath = '//table/tr[pos() > 1]' # data rows

        class item(ItemElement):
            klass = Stuff

            obj_id = CleanText(TableCell('id'))
            obj_foo = CleanText(TableCell('foo'))

Handle multiple tables with similar headers¶

Sometimes, you might encounter a page with multiple tables to parse. The columns are titled the same, but they aren’t at the same column index. So, it’s required to restart TableElement column processing for each table. It’s possible to encapsulate elements in other elements:

class MultiPage(HTMLPage):
    @method
    class iter_stuff(ListElement):
        item_xpath = '//table'

        class one_table(TableElement):
            head_xpath = './thead/tr/th'
            item_xpath = './tbody/tr'

            col_foo = 'Foo'

            class item(ItemElement):
                obj_foo = CleanText(TableCell('foo'))

Handle pagination¶

For a simple case where there’s a “Next” link in the page:

class MyPage(HTMLPage):
    @pagination
    @method
    class iter_stuff(ListElement):
        next_page = AbsoluteLink('//a[text()="Next"]', default=None)
        # when it evaluates to None, pagination stops

        item_xpath = '//ul/li'

        class item(ItemElement):
            klass = SomeClass
            obj_text = CleanText('.')

Handle pagination with POST¶

When going to next page requires making a POST:

class MyPage(JsonPage):
    @pagination
    @method
    class iter_stuff(ListElement):
        def next_page(self):
            if self.page.doc.get('next_page_params'):
                return requests.Request('POST', self.page.url, data=self.page.doc.get('next_page_params'))

        item_xpath = 'items'

        class item(ItemElement):
            klass = SomeClass
            obj_text = Dict('text')

Handle HTTP errors¶

HTTP errors raise exceptions that can be caught:

class MyBrowser(PagesBrowser):
    def do_stuff(self):
        try:
            self.location('/')
        except HTTPNotFound: # in case of 404
            pass
        except ClientError as e: # in case of 4xx
            # for all these exceptions, the response attribute is set
            self.logger.warning('failed with code %d', e.response.status_code)
            pass
        except ServerError: # in case of 5xx
            raise
        except HTTPError: # other cases
            raise

Parse an object from multiple pages¶

Sometimes, object info is spread across multiple pages:

class Page1(HTMLPage):
    @method
    class get_stuff(ItemElement):
        klass = Stuff

        obj_foo = CleanText('//h1')

class Page2(HTMLPage):
    @method
    class fill_stuff(ItemElement):
        klass = Stuff
        # if we're sure we'll always go on Page1 first, we can omit previous line

        obj_bar = CleanText('//a[@class="bar"]')

Since the pages contain different attributes, info can be merged easily:

def get_stuff(self):
    self.page1.go()
    stuff = self.page.get_stuff()  # get a Stuff object with foo set

    self.page2.go()
    self.page.fill_stuff(obj=stuff)  # fill existing Stuff object's bar

    return stuff  # foo and bar are set

This can also be useful for implementing fillobj().

Unfortunately, this doesn’t work when multiple objects are parsed (for example, in a ListElement). In this case, manual merging is required, and linking objects from each page.

Use `ItemElement` with non-scalar attributes¶

Some BaseObject subclasses have fields of other BaseObject types, for example:

class Foo(BaseObject):
    name = StringField('name')

class Bar(BaseObject):
    simple = StringField('simple')
    foo = Field('foo', Foo)
    multiple = Field('multiple foo objects', list)

They may still be parsed with ItemElement:

class item(ItemElement):
    klass = Bar

    obj_simple = Attr('span', 'class')

    class obj_foo(ItemElement):
        klass = Foo

        obj_name = CleanText('span')

This also works for ListElement:

class item(ItemElement):
    klass = Bar

    class obj_multiple(ListElement):
        item_xpath = './/li'

        class item(ItemElement):
            klass = Foo

            obj_name = CleanText('.')

Cookbook¶

POST an HTML form¶

POST form data¶

POST JSON¶

Add custom headers for one request¶

Add headers to all requests¶

Add query-string parameters¶

Have multiple pages on the same URL¶

Have a page which is sometimes a `LoggedPage`, sometimes isn’t¶

Skip items in a `ListElement`/`ItemElement`¶

Fix invalid HTML that prevents lxml to be parsed¶

Follow HTML `<meta>` redirects¶

Automatically follow a link on a page¶

Poll data, i.e. request an endpoint regularly¶

Parse data from an HTML table¶

Handle multiple tables with similar headers¶

Handle HTTP errors¶

Parse an object from multiple pages¶

Use `ItemElement` with non-scalar attributes¶

Navigation

External links

Related Topics

Cookbook¶

POST an HTML form¶

POST form data¶

POST JSON¶

Add custom headers for one request¶

Add headers to all requests¶

Add query-string parameters¶

Have multiple pages on the same URL¶

Have a page which is sometimes a LoggedPage, sometimes isn’t¶

Skip items in a ListElement/ItemElement¶

Fix invalid HTML that prevents lxml to be parsed¶

Follow HTML <meta> redirects¶

Automatically follow a link on a page¶

Poll data, i.e. request an endpoint regularly¶

Parse data from an HTML table¶

Handle multiple tables with similar headers¶

Handle pagination¶

Handle pagination with POST¶

Handle HTTP errors¶

Parse an object from multiple pages¶

Use ItemElement with non-scalar attributes¶

Have a page which is sometimes a `LoggedPage`, sometimes isn’t¶

Skip items in a `ListElement`/`ItemElement`¶

Follow HTML `<meta>` redirects¶

Use `ItemElement` with non-scalar attributes¶