woob.browser.filters.standard
¶
- class Filter(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
_Filter
Class used to filter on a HTML element given as call parameter to return matching elements.
Filters can be chained, so the parameter supplied to constructor can be either a xpath selector string, or an other filter called before.
>>> from lxml.html import etree >>> f = CleanDecimal(CleanText('//p'), replace_dots=True) >>> f(etree.fromstring('<html><body><p>blah: <span>229,90</span></p></body></html>')) Decimal('229.90')
- exception FilterError[source]¶
Bases:
ParseError
- exception RegexpError[source]¶
Bases:
FilterError
- exception FormatError[source]¶
Bases:
FilterError
- class AsyncLoad(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Load a page asynchronously for later use.
Often used in combination with
Async
filter.
- class Async(name, selector=None)[source]¶
Bases:
Filter
Selector that uses another page fetched earlier.
Often used in combination with
AsyncLoad
filter. Requires that the other page’s URL is matched with a Page by the Browser.Example:
class item(ItemElement): load_details = Field('url') & AsyncLoad obj_description = Async('details') & CleanText('//h3')
- class Base(base, selector=None, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Change the base element used in filters.
>>> Base(Env('header'), CleanText('./h1'))
- class Decode(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Filter that aims to decode urlencoded strings
>>> Decode(Env('_id')) <woob.browser.filters.standard.Decode object at 0x...> >>> from .html import Link >>> Decode(Link('./a')) <woob.browser.filters.standard.Decode object at 0x...>
- class Env(name, default=_NO_DEFAULT)[source]¶
Bases:
_Filter
Filter to get environment value of the item.
It is used for example to get page parameters, or when there is a parse() method on ItemElement.
- class RawText(selector=None, children=False, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Get raw text from an element.
Unlike
CleanText
, whitespace is kept as is.
- class CleanText(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
Filter
Get a cleaned text from an element.
It first replaces all tabs and multiple spaces (including newlines if
newlines
is True) to one space and strips the result string.The result is coerced into str, and optionally normalized according to the
normalize
argument.Then it replaces all symbols given in the
symbols
argument.>>> CleanText().filter('coucou ') == u'coucou' True >>> CleanText().filter(u'coucou coucou') == u'coucou coucou' True >>> CleanText(newlines=True).filter(u'coucou\r\n coucou ') == u'coucou coucou' True >>> CleanText(newlines=False).filter(u'coucou\r\n coucou ') == u'coucou\ncoucou' True
- class Lower(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
Extract text with
CleanText
and convert to lower-case.
- class Upper(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
Extract text with
CleanText
and convert to upper-case.
- class Title(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
Extract text with
CleanText
and apply title() to it.
- class Currency(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
- exception NumberFormatError[source]¶
Bases:
FormatError
,InvalidOperation
- class CleanDecimal(selector=None, replace_dots=False, sign=None, legacy=True, default=_NO_DEFAULT)[source]¶
Bases:
CleanText
Get a cleaned Decimal value from an element.
replace_dots is False by default. A dot is interpreted as a decimal separator.
If replace_dots is set to True, we remove all the dots. The ‘,’ is used as decimal separator (often useful for French values)
If replace_dots is a tuple, the first element will be used as the thousands separator, and the second as the decimal separator.
See https://en.wikipedia.org/wiki/Thousands_separator#Examples_of_use
For example, for the UK style (as in 1,234,567.89):
>>> CleanDecimal('./td[1]', replace_dots=(',', '.'))
- class Type(selector=None, type=None, minlen=0, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Get a cleaned value of any type from an element text. The type_func can be any callable (class, function, etc.). By default an empty string will not be parsed but it can be changed by specifying minlen=False. Otherwise, a minimal length can be specified.
>>> Type(CleanText('./td[1]'), type=int)
>>> Type(type=int).filter(42) 42 >>> Type(type=int).filter('42') 42 >>> Type(type=int, default='NaN').filter('') 'NaN' >>> Type(type=list, minlen=False, default=list('ab')).filter('') [] >>> Type(type=list, minlen=0, default=list('ab')).filter('') ['a', 'b']
- class Field(name)[source]¶
Bases:
_Filter
Get the attribute of object.
Example:
obj_foo = CleanText('//h1') obj_bar = Field('foo')
will make “bar” field equal to “foo” field.
- class Regexp(selector=None, pattern=None, template=None, nth=0, flags=0, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Apply a regex.
>>> from lxml.html import etree >>> doc = etree.fromstring('<html><body><p>Date: <span>13/08/1988</span></p></body></html>') >>> Regexp(CleanText('//p'), r'Date: (\d+)/(\d+)/(\d+)', '\\3-\\2-\\1')(doc) == u'1988-08-13' True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=1))(doc) == u'08' True >>> (Regexp(CleanText('//body'), r'(\d+)', nth=-1))(doc) == u'1988' True >>> (Regexp(CleanText('//body'), r'(\d+)', template='[\\1]', nth='*'))(doc) == [u'[13]', u'[08]', u'[1988]'] True >>> (Regexp(CleanText('//body'), r'Date:.*'))(doc) == u'Date: 13/08/1988' True >>> (Regexp(CleanText('//body'), r'^(?!Date:).*', default=None))(doc) >>>
- filter(txt)[source]¶
- Raises:
RegexpError
if pattern was not found
- class Map(selector, map_dict, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Map selected value to another value using a dict.
Example:
TYPES = { 'Concert': CATEGORIES.CONCERT, 'Cinéma': CATEGORIES.CINE, } obj_type = Map(CleanText('./li'), TYPES)
- class MapIn(selector, map_dict, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Map the pattern of a selected value to another value using a dict.
- class DateTime(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, tzinfo=None, **kwargs)[source]¶
Bases:
Filter
Parse date and time.
- class FromTimestamp(selector, millis=False, tz=None, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Parse a timestamp into a datetime.
- class Date(selector=None, default=_NO_DEFAULT, translations=None, parse_func=parse_date, strict=True, **kwargs)[source]¶
Bases:
DateTime
Parse date.
- class Time(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Parse time.
- kwargs = {'hour': 'hh', 'minute': 'mm', 'second': 'ss'}¶
- class Duration(selector=None, default=_NO_DEFAULT)[source]¶
Bases:
Time
Parse a duration as timedelta.
- kwargs = {'hours': 'hh', 'minutes': 'mm', 'seconds': 'ss'}¶
- class CombineDate(date, time)[source]¶
Bases:
MultiFilter
Combine separate Date and Time filters into a single datetime.
- class Format(fmt, *args)[source]¶
Bases:
MultiFilter
Combine multiple filters with string-format.
Example:
obj_title = Format('%s (%s)', CleanText('//h1'), CleanText('//h2'))
will concatenate the text from all
<h1>
and all<h2>
(but put the latter between parentheses).
- class BrowserURL(url_name, **kwargs)[source]¶
Bases:
MultiFilter
Format URL using names in parent Browser
This filter allows to format URL using an URL defined in browser instance of this page.
class MyBrowser:
mypage = URL(‘(?P<category>w+)/(?P<id>w+)’)
- class OnePage(Page):
- class item(ItemElement):
obj_myfield = BrowserURL(‘mypage’, id=Dict(‘id’), category=Dict(‘category’))
- class Join(pattern, selector=None, textCleaner=CleanText, newline=False, addBefore='', addAfter='', default=_NO_DEFAULT)[source]¶
Bases:
Filter
Join multiple results from a selector. >>> Join(’ - ‘, ‘//div/p’) # doctest: +SKIP
>>> Join(pattern=', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour" True >>> Join(pattern='-').filter([u"Au", u"revoir", ""]) == u"Au-revoir" True >>> Join(pattern='-').filter([]) == u"" True >>> Join(pattern='-', default=u'empty').filter([]) == u'empty' True
- class MultiJoin(*args, **kwargs)[source]¶
Bases:
MultiFilter
Join multiple filters. >>> MultiJoin(Field(‘field1’), Field(‘field2’)) # doctest: +SKIP
>>> MultiJoin(pattern=u', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour" True >>> MultiJoin(pattern=u'-').filter([u"Au", u"revoir", ""]) == u"Au-revoir" True >>> MultiJoin(pattern=u'-').filter([]) == u"" True >>> MultiJoin(pattern=u'-', default=u'empty').filter([]) == u'empty' True >>> MultiJoin(pattern=u'-').filter([1, 2, 3]) == u'1-2-3' True
- class Eval(func, *args)[source]¶
Bases:
MultiFilter
Evaluate a function with given ‘deferred’ arguments.
>>> F = Field; Eval(lambda a, b, c: a * b + c, F('foo'), F('bar'), F('baz')) >>> Eval(lambda x, y: x * y + 1).filter([3, 7]) 22
Example:
obj_ratio = Eval(lambda x: x / 100, Env('percentage'))
- class QueryValue(selector, key, default=_NO_DEFAULT)[source]¶
Bases:
Filter
Extract the value of a parameter from an URL with a query string.
>>> from lxml.html import etree >>> from .html import Link >>> f = QueryValue(Link('//a'), 'id') >>> f(etree.fromstring('<html><body><a href="https://example.org/view?id=1234"></a></body></html>')) == u'1234' True
- class Coalesce(*args, **kwargs)[source]¶
Bases:
MultiFilter
Returns the first value that is not falsy, or default if all values are falsy.
- class CountryCode(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶
Bases:
CleanText
Filter to get the country ISO 3166-1 alpha-2 code from the country name
- filter(txt)[source]¶
Get the country code from the name of the country
- Parameters:
txt (str) – Country name
- Raises:
FormatError – if the Country name is not found
>>> CountryCode().filter('france') :rtype: :py:data:`~typing.Any` 'fr' >>> CountryCode(default= 'll').filter('Greez') 'll'