`woob.browser.filters.standard`¶

class Filter(selector=None, default=_NO_DEFAULT)[source]¶

Bases: _Filter

Class used to filter on a HTML element given as call parameter to return matching elements.

Filters can be chained, so the parameter supplied to constructor can be either a xpath selector string, or an other filter called before.

>>> from lxml.html import etree
>>> f = CleanDecimal(CleanText('//p'), replace_dots=True)
>>> f(etree.fromstring('<html><body><p>blah: <span>229,90</span></p></body></html>'))
Decimal('229.90')

select(selector, item)[source]¶

filter(value)[source]¶: This method has to be overridden by children classes.

exception FilterError[source]¶: Bases: ParseError

exception RegexpError[source]¶: Bases: FilterError

exception FormatError[source]¶: Bases: FilterError

class AsyncLoad(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Load a page asynchronously for later use.

Often used in combination with Async filter.

class Async(name, selector=None)[source]¶

Bases: Filter

Selector that uses another page fetched earlier.

Often used in combination with AsyncLoad filter. Requires that the other page’s URL is matched with a Page by the Browser.

Example:

class item(ItemElement):
    load_details = Field('url') & AsyncLoad

    obj_description = Async('details') & CleanText('//h3')

filter(*args)[source]¶: This method has to be overridden by children classes.

loaded_page(item)[source]¶

class Base(base, selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Change the base element used in filters.

>>> Base(Env('header'), CleanText('./h1'))  

class Decode(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Filter that aims to decode urlencoded strings

>>> Decode(Env('_id'))  
<woob.browser.filters.standard.Decode object at 0x...>
>>> from .html import Link
>>> Decode(Link('./a'))  
<woob.browser.filters.standard.Decode object at 0x...>

filter(txt)[source]¶: This method has to be overridden by children classes.

class Env(name, default=_NO_DEFAULT)[source]¶

Bases: _Filter

Filter to get environment value of the item.

It is used for example to get page parameters, or when there is a parse() method on ItemElement.

class RawText(selector=None, children=False, default=_NO_DEFAULT)[source]¶

Bases: Filter

Get raw text from an element.

Unlike CleanText, whitespace is kept as is.

filter(el)[source]¶: This method has to be overridden by children classes.

class CleanText(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: Filter

Get a cleaned text from an element.

It first replaces all tabs and multiple spaces (including newlines if newlines is True) to one space and strips the result string.

The result is coerced into str, and optionally normalized according to the normalize argument.

Then it replaces all symbols given in the symbols argument.

>>> CleanText().filter('coucou ') == u'coucou'
True
>>> CleanText().filter(u'coucou coucou') == u'coucou coucou'
True
>>> CleanText(newlines=True).filter(u'coucou\r\n coucou ') == u'coucou coucou'
True
>>> CleanText(newlines=False).filter(u'coucou\r\n coucou ') == u'coucou\ncoucou'
True

filter(txt)[source]¶: This method has to be overridden by children classes.

classmethod clean(txt, children=True, newlines=True, normalize='NFC', transliterate=False)[source]¶: Cleans the text. The children argument is ignored with Selenium.

classmethod remove(txt, symbols)[source]¶

classmethod replace(txt, replace)[source]¶

class Lower(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

Extract text with CleanText and convert to lower-case.

filter(txt)[source]¶: This method has to be overridden by children classes.

class Upper(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

Extract text with CleanText and convert to upper-case.

filter(txt)[source]¶: This method has to be overridden by children classes.

class Title(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

Extract text with CleanText and apply title() to it.

filter(txt)[source]¶: This method has to be overridden by children classes.

class Currency(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

filter(txt)[source]¶: This method has to be overridden by children classes.

exception NumberFormatError[source]¶: Bases: FormatError, InvalidOperation

class CleanDecimal(selector=None, replace_dots=False, sign=None, legacy=True, default=_NO_DEFAULT)[source]¶

Bases: CleanText

Get a cleaned Decimal value from an element.

replace_dots is False by default. A dot is interpreted as a decimal separator.

If replace_dots is set to True, we remove all the dots. The ‘,’ is used as decimal separator (often useful for French values)

If replace_dots is a tuple, the first element will be used as the thousands separator, and the second as the decimal separator.

See https://en.wikipedia.org/wiki/Thousands_separator#Examples_of_use

For example, for the UK style (as in 1,234,567.89):

>>> CleanDecimal('./td[1]', replace_dots=(',', '.'))  

filter(text)[source]¶: This method has to be overridden by children classes.

classmethod US(*args, **kwargs)[source]¶

classmethod French(*args, **kwargs)[source]¶

classmethod SI(*args, **kwargs)[source]¶

classmethod Italian(*args, **kwargs)[source]¶

class Slugify(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

filter(label)[source]¶: This method has to be overridden by children classes.

class Type(selector=None, type=None, minlen=0, default=_NO_DEFAULT)[source]¶

Bases: Filter

Get a cleaned value of any type from an element text. The type_func can be any callable (class, function, etc.). By default an empty string will not be parsed but it can be changed by specifying minlen=False. Otherwise, a minimal length can be specified.

>>> Type(CleanText('./td[1]'), type=int)  

>>> Type(type=int).filter(42)
42
>>> Type(type=int).filter('42')
42
>>> Type(type=int, default='NaN').filter('')
'NaN'
>>> Type(type=list, minlen=False, default=list('ab')).filter('')
[]
>>> Type(type=list, minlen=0, default=list('ab')).filter('')
['a', 'b']

filter(txt)[source]¶: This method has to be overridden by children classes.

class Field(name)[source]¶

Bases: _Filter

Get the attribute of object.

Example:

obj_foo = CleanText('//h1')
obj_bar = Field('foo')

will make “bar” field equal to “foo” field.

class Regexp(selector=None, pattern=None, template=None, nth=0, flags=0, default=_NO_DEFAULT)[source]¶

Bases: Filter

Apply a regex.

>>> from lxml.html import etree
>>> doc = etree.fromstring('<html><body><p>Date: <span>13/08/1988</span></p></body></html>')
>>> Regexp(CleanText('//p'), r'Date: (\d+)/(\d+)/(\d+)', '\\3-\\2-\\1')(doc) == u'1988-08-13'
True

>>> (Regexp(CleanText('//body'), r'(\d+)', nth=1))(doc) == u'08'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=-1))(doc) == u'1988'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', template='[\\1]', nth='*'))(doc) == [u'[13]', u'[08]', u'[1988]']
True
>>> (Regexp(CleanText('//body'), r'Date:.*'))(doc) == u'Date: 13/08/1988'
True
>>> (Regexp(CleanText('//body'), r'^(?!Date:).*', default=None))(doc)
>>>

expand(m)[source]¶

filter(txt)[source]¶

Raises:: RegexpError if pattern was not found

class Map(selector, map_dict, default=_NO_DEFAULT)[source]¶

Bases: Filter

Map selected value to another value using a dict.

Example:

TYPES = {
    'Concert': CATEGORIES.CONCERT,
    'Cinéma': CATEGORIES.CINE,
}

obj_type = Map(CleanText('./li'), TYPES)

filter(txt)[source]¶

Raises:: ItemNotFound if key does not exist in dict

class MapIn(selector, map_dict, default=_NO_DEFAULT)[source]¶

Bases: Filter

Map the pattern of a selected value to another value using a dict.

filter(txt)[source]¶

Raises:: ItemNotFound if key pattern does not exist in dict

class DateTime(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, tzinfo=None, **kwargs)[source]¶

Bases: Filter

Parse date and time.

filter(txt)[source]¶: This method has to be overridden by children classes.

class FromTimestamp(selector, millis=False, tz=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Parse a timestamp into a datetime.

filter(txt)[source]¶: This method has to be overridden by children classes.

class Date(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, **kwargs)[source]¶

Bases: DateTime

Parse date.

filter(txt)[source]¶: This method has to be overridden by children classes.

class DateGuesser(selector, date_guesser, **kwargs)[source]¶: Bases: Filter

class Time(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Parse time.

klass¶: alias of time

kwargs = {'hour': 'hh', 'minute': 'mm', 'second': 'ss'}¶

filter(txt)[source]¶: This method has to be overridden by children classes.

class Duration(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Time

Parse a duration as timedelta.

klass¶: alias of timedelta

kwargs = {'hours': 'hh', 'minutes': 'mm', 'seconds': 'ss'}¶

class MultiFilter(*args, **kwargs)[source]¶

Bases: Filter

filter(values)[source]¶: This method has to be overridden by children classes.

class CombineDate(date, time)[source]¶

Bases: MultiFilter

Combine separate Date and Time filters into a single datetime.

filter(values)[source]¶: This method has to be overridden by children classes.

class Format(fmt, *args)[source]¶

Bases: MultiFilter

Combine multiple filters with string-format.

Example:

obj_title = Format('%s (%s)', CleanText('//h1'), CleanText('//h2'))

will concatenate the text from all <h1> and all <h2> (but put the latter between parentheses).

filter(values)[source]¶: This method has to be overridden by children classes.

class BrowserURL(url_name, **kwargs)[source]¶

Bases: MultiFilter

Format URL using names in parent Browser

This filter allows to format URL using an URL defined in browser instance of this page.

class MyBrowser:

mypage = URL(‘(?P<category>w+)/(?P<id>w+)’)

class OnePage(Page):

class item(ItemElement):: obj_myfield = BrowserURL(‘mypage’, id=Dict(‘id’), category=Dict(‘category’))

filter(values)[source]¶: This method has to be overridden by children classes.

class Join(pattern, selector=None, textCleaner=CleanText, newline=False, addBefore='', addAfter='', default=_NO_DEFAULT)[source]¶

Bases: Filter

Join multiple results from a selector. >>> Join(’ - ‘, ‘//div/p’) # doctest: +SKIP

>>> Join(pattern=', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour"
True
>>> Join(pattern='-').filter([u"Au", u"revoir", ""]) == u"Au-revoir"
True
>>> Join(pattern='-').filter([]) == u""
True
>>> Join(pattern='-', default=u'empty').filter([]) == u'empty'
True

filter(el)[source]¶: This method has to be overridden by children classes.

class MultiJoin(*args, **kwargs)[source]¶

Bases: MultiFilter

Join multiple filters. >>> MultiJoin(Field(‘field1’), Field(‘field2’)) # doctest: +SKIP

>>> MultiJoin(pattern=u', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour"
True
>>> MultiJoin(pattern=u'-').filter([u"Au", u"revoir", ""]) == u"Au-revoir"
True
>>> MultiJoin(pattern=u'-').filter([]) == u""
True
>>> MultiJoin(pattern=u'-', default=u'empty').filter([]) == u'empty'
True
>>> MultiJoin(pattern=u'-').filter([1, 2, 3]) == u'1-2-3'
True

filter(values)[source]¶: This method has to be overridden by children classes.

class Eval(func, *args)[source]¶

Bases: MultiFilter

Evaluate a function with given ‘deferred’ arguments.

>>> F = Field; Eval(lambda a, b, c: a * b + c, F('foo'), F('bar'), F('baz')) 
>>> Eval(lambda x, y: x * y + 1).filter([3, 7])
22

Example:

obj_ratio = Eval(lambda x: x / 100, Env('percentage'))

filter(values)[source]¶: This method has to be overridden by children classes.

class QueryValue(selector, key, default=_NO_DEFAULT)[source]¶

Bases: Filter

Extract the value of a parameter from an URL with a query string.

>>> from lxml.html import etree
>>> from .html import Link
>>> f = QueryValue(Link('//a'), 'id')
>>> f(etree.fromstring('<html><body><a href="https://example.org/view?id=1234"></a></body></html>')) == u'1234'
True

filter(url)[source]¶: This method has to be overridden by children classes.

class Coalesce(*args, **kwargs)[source]¶

Bases: MultiFilter

Returns the first value that is not falsy, or default if all values are falsy.

filter(values)[source]¶: This method has to be overridden by children classes.

class CountryCode(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

Filter to get the country ISO 3166-1 alpha-2 code from the country name

filter(txt)[source]¶

Get the country code from the name of the country

Parameters:: txt (str) – Country name
Raises:: FormatError – if the Country name is not found

>>> CountryCode().filter('france')
:rtype: :py:data:`~typing.Any`
'fr'
>>> CountryCode(default= 'll').filter('Greez')
'll'

`woob.browser.filters.standard`¶

Navigation

External links

Related Topics

woob.browser.filters.standard¶

`woob.browser.filters.standard`¶