reading pdf file in python

There are multiple libraries available in Python for reading PDF files. Some of the popular ones are:

  1. pdfminer: It is a Python library to extract information from PDF documents. It can be used to extract text, images, and other data from PDF files.

  2. PyPDF2: It is another Python library for handling PDF documents. It allows you to merge, split, crop, and transform PDF files. You can also extract text and images.

  3. tabula-py: It is a library for extracting tables from PDF files. It uses a Java library called Tabula, which performs the actual PDF parsing.

  4. pdfplumber: It is a library for extracting information from PDF files. It can be used to extract text, tables, and other data from PDF documents. It is built on top of pdfminer.

Here is an example using pdfminer library to extract text from a PDF file:

main.py
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def pdf_to_text(path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(
        resource_manager,
        fake_file_handle,
        codec='utf-8',
        laparams=LAParams()
    )
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
 
    with open(path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            page_interpreter.process_page(page)
 
        text = fake_file_handle.getvalue()
 
    # close open handles
    converter.close()
    fake_file_handle.close()
 
    if text:
        return text
837 chars
30 lines

You can call the above function with the path of the PDF file as an argument to get the text from it.

gistlibby LogSnag