woob.browser.filters.standard¶
- class Filter(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
_FilterClass used to filter on a HTML element given as call parameter to return matching elements.
Filters can be chained, so the parameter supplied to constructor can be either a xpath selector string, or an other filter called before.
>>> from lxml.html import etree >>> f = CleanDecimal(CleanText('//p'), replace_dots=True) >>> f(etree.fromstring('<html><body><p>blah: <span>229,90</span></p></body></html>')) Decimal('229.90')
- exception FilterError[source]¶
Bases:
ParseError
- exception RegexpError[source]¶
Bases:
FilterError
- exception FormatError[source]¶
Bases:
FilterError
- class AsyncLoad(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
FilterLoad a page asynchronously for later use.
Often used in combination with
Asyncfilter.
- class Async(name, selector=None)[source]¶
Bases:
FilterSelector that uses another page fetched earlier.
Often used in combination with
AsyncLoadfilter. Requires that the other page’s URL is matched with a Page by the Browser.Example:
class item(ItemElement): load_details = Field('url') & AsyncLoad obj_description = Async('details') & CleanText('//h3')
- class Base(base, selector=None, default=_NO_DEFAULT)[source]¶
Bases:
FilterChange the base element used in filters.
>>> Base(Env('header'), CleanText('./h1'))
- class Decode(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
FilterFilter that aims to decode urlencoded strings
>>> Decode(Env('_id')) <woob.browser.filters.standard.Decode object at 0x...> >>> from .html import Link >>> Decode(Link('./a')) <woob.browser.filters.standard.Decode object at 0x...>
- class Env(name, default=_NO_DEFAULT)[source]¶
Bases:
_FilterFilter to get environment value of the item.
It is used for example to get page parameters, or when there is a parse() method on ItemElement.
- class RawText(selector=None, children=False, default=_NO_DEFAULT)[source]¶
Bases:
FilterGet raw text from an element.
Unlike
CleanText, whitespace is kept as is.
- class CleanText(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
FilterGet a cleaned text from an element.
It first replaces all tabs and multiple spaces (including newlines if
newlinesis True) to one space and strips the result string.The result is coerced into str, and optionally normalized according to the
normalizeargument.Then it replaces all symbols given in the
symbolsargument.>>> CleanText().filter('coucou ') == u'coucou' True >>> CleanText().filter(u'coucou coucou') == u'coucou coucou' True >>> CleanText(newlines=True).filter(u'coucou\r\n coucou ') == u'coucou coucou' True >>> CleanText(newlines=False).filter(u'coucou\r\n coucou ') == u'coucou\ncoucou' True
- class Lower(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanTextExtract text with
CleanTextand convert to lower-case.
- class Upper(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanTextExtract text with
CleanTextand convert to upper-case.
- class Title(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanTextExtract text with
CleanTextand apply title() to it.
- class Currency(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
- exception NumberFormatError[source]¶
Bases:
FormatError,InvalidOperation
- class CleanDecimal(selector=None, replace_dots=False, sign=None, legacy=True, default=_NO_DEFAULT)[source]¶
Bases:
CleanTextGet a cleaned Decimal value from an element.
replace_dots is False by default. A dot is interpreted as a decimal separator.
If replace_dots is set to True, we remove all the dots. The ‘,’ is used as decimal separator (often useful for French values)
If replace_dots is a tuple, the first element will be used as the thousands separator, and the second as the decimal separator.
See https://en.wikipedia.org/wiki/Thousands_separator#Examples_of_use
For example, for the UK style (as in 1,234,567.89):
>>> CleanDecimal('./td[1]', replace_dots=(',', '.'))
- class Type(selector=None, type=None, minlen=0, default=_NO_DEFAULT)[source]¶
Bases:
FilterGet a cleaned value of any type from an element text. The type_func can be any callable (class, function, etc.). By default an empty string will not be parsed but it can be changed by specifying minlen=False. Otherwise, a minimal length can be specified.
>>> Type(CleanText('./td[1]'), type=int)
>>> Type(type=int).filter(42) 42 >>> Type(type=int).filter('42') 42 >>> Type(type=int, default='NaN').filter('') 'NaN' >>> Type(type=list, minlen=False, default=list('ab')).filter('') [] >>> Type(type=list, minlen=0, default=list('ab')).filter('') ['a', 'b']
- class Field(name)[source]¶
Bases:
_FilterGet the attribute of object.
Example:
obj_foo = CleanText('//h1') obj_bar = Field('foo')
will make “bar” field equal to “foo” field.
- class Regexp(selector=None, pattern=None, template=None, nth=0, flags=0, default=_NO_DEFAULT)[source]¶
Bases:
FilterApply a regex.
>>> from lxml.html import etree >>> doc = etree.fromstring('<html><body><p>Date: <span>13/08/1988</span></p></body></html>') >>> Regexp(CleanText('//p'), r'Date: (\d+)/(\d+)/(\d+)', '\\3-\\2-\\1')(doc) == u'1988-08-13' True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=1))(doc) == u'08' True >>> (Regexp(CleanText('//body'), r'(\d+)', nth=-1))(doc) == u'1988' True >>> (Regexp(CleanText('//body'), r'(\d+)', template='[\\1]', nth='*'))(doc) == [u'[13]', u'[08]', u'[1988]'] True >>> (Regexp(CleanText('//body'), r'Date:.*'))(doc) == u'Date: 13/08/1988' True >>> (Regexp(CleanText('//body'), r'^(?!Date:).*', default=None))(doc) >>>
- filter(txt)[source]¶
- Raises:
RegexpErrorif pattern was not found
- class Map(selector, map_dict, default=_NO_DEFAULT)[source]¶
Bases:
FilterMap selected value to another value using a dict.
Example:
TYPES = { 'Concert': CATEGORIES.CONCERT, 'Cinéma': CATEGORIES.CINE, } obj_type = Map(CleanText('./li'), TYPES)
- class MapIn(selector, map_dict, default=_NO_DEFAULT)[source]¶
Bases:
FilterMap the pattern of a selected value to another value using a dict.
- class DateTime(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, tzinfo=None, **kwargs)[source]¶
Bases:
FilterParse date and time.
- class FromTimestamp(selector, millis=False, tz=None, default=_NO_DEFAULT)[source]¶
Bases:
FilterParse a timestamp into a datetime.
- class Date(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, **kwargs)[source]¶
Bases:
DateTimeParse date.
- class Time(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
FilterParse time.
- kwargs = {'hour': 'hh', 'minute': 'mm', 'second': 'ss'}¶
- class Duration(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
TimeParse a duration as timedelta.
- kwargs = {'hours': 'hh', 'minutes': 'mm', 'seconds': 'ss'}¶
- class CombineDate(date, time)[source]¶
Bases:
MultiFilterCombine separate Date and Time filters into a single datetime.
- class Format(fmt, *args)[source]¶
Bases:
MultiFilterCombine multiple filters with string-format.
Example:
obj_title = Format('%s (%s)', CleanText('//h1'), CleanText('//h2'))
will concatenate the text from all
<h1>and all<h2>(but put the latter between parentheses).
- class BrowserURL(url_name, **kwargs)[source]¶
Bases:
MultiFilterFormat URL using names in parent Browser
This filter allows to format URL using an URL defined in browser instance of this page.
class MyBrowser:
mypage = URL(‘(?P<category>w+)/(?P<id>w+)’)
- class OnePage(Page):
- class item(ItemElement):
obj_myfield = BrowserURL(‘mypage’, id=Dict(‘id’), category=Dict(‘category’))
- class Join(pattern, selector=None, textCleaner=CleanText, newline=False, addBefore='', addAfter='', default=_NO_DEFAULT)[source]¶
Bases:
FilterJoin multiple results from a selector. >>> Join(’ - ‘, ‘//div/p’) # doctest: +SKIP
>>> Join(pattern=', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour" True >>> Join(pattern='-').filter([u"Au", u"revoir", ""]) == u"Au-revoir" True >>> Join(pattern='-').filter([]) == u"" True >>> Join(pattern='-', default=u'empty').filter([]) == u'empty' True
- class MultiJoin(*args, **kwargs)[source]¶
Bases:
MultiFilterJoin multiple filters. >>> MultiJoin(Field(‘field1’), Field(‘field2’)) # doctest: +SKIP
>>> MultiJoin(pattern=u', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour" True >>> MultiJoin(pattern=u'-').filter([u"Au", u"revoir", ""]) == u"Au-revoir" True >>> MultiJoin(pattern=u'-').filter([]) == u"" True >>> MultiJoin(pattern=u'-', default=u'empty').filter([]) == u'empty' True >>> MultiJoin(pattern=u'-').filter([1, 2, 3]) == u'1-2-3' True
- class Eval(func, *args)[source]¶
Bases:
MultiFilterEvaluate a function with given ‘deferred’ arguments.
>>> F = Field; Eval(lambda a, b, c: a * b + c, F('foo'), F('bar'), F('baz')) >>> Eval(lambda x, y: x * y + 1).filter([3, 7]) 22
Example:
obj_ratio = Eval(lambda x: x / 100, Env('percentage'))
- class QueryValue(selector, key, default=_NO_DEFAULT)[source]¶
Bases:
FilterExtract the value of a parameter from an URL with a query string.
>>> from lxml.html import etree >>> from .html import Link >>> f = QueryValue(Link('//a'), 'id') >>> f(etree.fromstring('<html><body><a href="https://example.org/view?id=1234"></a></body></html>')) == u'1234' True
- class Coalesce(*args, **kwargs)[source]¶
Bases:
MultiFilterReturns the first value that is not falsy, or default if all values are falsy.
- class CountryCode(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanTextFilter to get the country ISO 3166-1 alpha-2 code from the country name
- filter(txt)[source]¶
Get the country code from the name of the country
- Parameters:
txt (str) – Country name
- Raises:
FormatError – if the Country name is not found
>>> CountryCode().filter('france') :rtype: :py:data:`~typing.Any` 'fr' >>> CountryCode(default= 'll').filter('Greez') 'll'