woob.tools.tokenizer

class ReTokenizer(text, sep, lex)[source]

Bases: object

Simple regex-based tokenizer (AKA lexer or lexical analyser). Useful for PDF statements parsing.

  1. There’s a lexing table consisting of type-regex tuples.

  2. Lexer splits text into chunks using the separator character.

  3. Text chunk is sequentially matched against regexes and first successful match defines the type of the token.

Check out test() function below for examples.

tok(index)[source]
simple_read(token_type, pos, transform=lambda v: ...)[source]