woob.tools.pdf

decompress_pdf(inpdf)[source]

Takes PDF file contents as a string and returns decompressed version of the file contents, suitable for text parsing.

External dependencies: MuPDF (https://www.mupdf.com).

Return type:

bytes

get_pdf_rows(data, miner_layout=True)[source]

Takes PDF file content as string and yield table row data for each page.

For each page in the PDF, the function yields a list of rows. Each row is a list of cells. Each cell is a list of strings present in the cell. Note that the rows may belong to different tables.

There are no logic tables in PDF format, so this parses PDF drawing instructions and tries to find rectangles and arrange them in rows, then arrange text in the rectangles.

External dependencies: PDFMiner (https://github.com/euske/pdfminer).