woob.tools.tokenizer¶
- class ReTokenizer(text, sep, lex)[source]¶
- Bases: - object- Simple regex-based tokenizer (AKA lexer or lexical analyser). Useful for PDF statements parsing. - There’s a lexing table consisting of type-regex tuples. 
- Lexer splits text into chunks using the separator character. 
- Text chunk is sequentially matched against regexes and first successful match defines the type of the token. 
 - Check out test() function below for examples. 
