Como ler o arquivo PDF linha por linha usando Python
Python pode ler arquivos PDF e imprimir o conteúdo depois de extrair o texto do mesmo. Para isso temos de instalar primeiro o módulo necessário que é o PyPDF2. Abaixo está o comando para instalar o módulo. You should have pip already installed in your python environment.
- pip install pypdf2
On successful installation of this module we can read PDF files using the methods available in the module.
Reading Single Page
- import PyPDF2
- pdfName = 'path\xyz.pdf'
- read_pdf = PyPDF2.PdfFileReader(pdfName)
- page = read_pdf.getPage(0)
- page_content = page.extractText()
- print page_content
When we run the above program, we get the output
Reading Multiple Pages
To read a pdf with multiple pages and print each of the page with a page number we use the a loop with getPageNumber() function. No exemplo abaixo temos o arquivo PDF que tem duas páginas. The contents are printed under two separate page headings.
- import PyPDF2
- pdfName = 'Path\xyz2.pdf'
- read_pdf = PyPDF2.PdfFileReader(pdfName)
- for i in xrange(read_pdf.getNumPages()):
- page = read_pdf.getPage(i)
- print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
- page_content = page.extractText()
- print page_content
Thanks for reading, and as always, be sure to reach out with any questions! Follow Jayasimha Kv