如何从PDF文件中提取文本?

问题:如何从PDF文件中提取文本?

我正在尝试使用提取 PDF文件中包含的文本Python

我正在使用PyPDF2模块,并具有以下脚本:

import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

运行代码时,得到以下输出,该输出与PDF文档中包含的输出不同:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

如何提取PDF文档中的原样?

I’m trying to extract the text included in this PDF file using Python.

I’m using the PyPDF2 module, and have the following script:

import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

When I run the code, I get the following output which is different from that included in the PDF document:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

How can I extract the text as is in the PDF document?


回答 0

我一直在寻找适用于python 3.x和Windows的简单解决方案。不幸的是,textract似乎并没有提供支持,但是如果您正在寻找针对Windows / python 3的简单解决方案,请签出tika软件包,真的可以直接阅读pdf。

Tika-Python是与Apache Tika™REST服务结合的Python,允许在Python社区中本地调用Tika。

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

请注意,Tika是用Java编写的,因此您需要安装Java运行时

I was looking for a simple solution to use for python 3.x and windows. There doesn’t seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed


回答 1

使用textract。

它支持多种类型的文件,包括PDF

import textract
text = textract.process("path/to/file.extension")

Use textract.

It supports many types of files including PDFs

import textract
text = textract.process("path/to/file.extension")

回答 2

看下面的代码:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

输出为:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

使用相同的代码从201308FCR.pdf读取pdf 。输出正常。

文档说明了原因:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

Look at this code:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

The output is:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

Its documentation explains why:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

回答 3

在尝试了textract(似乎有太多的依赖性)和pypdf2(无法从我测试过的pdf中提取文本)和tika(太慢)之后,我最终pdftotext从xpdf中使用了(正如在另一个答案中已经建议的那样),只是直接从python调用了二进制文件(您可能需要将路径修改为pdftotext):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
        '-enc',
        'UTF-8',
        "{}/my-pdf.pdf".format(SCRIPT_DIR),
        '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

pdftotext基本上相同,但是假定/ usr / local / bin中为pdftotext,而我在AWS Lambda中使用它,并希望从当前目录中使用它。

顺便说一句:为了在lambda上使用它,您需要将二进制文件和依赖项libstdc++.so放入lambda函数中。我个人需要编译xpdf。由于有关此问题的说明会引爆此答案,因此我将其放在我的个人博客中

After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
        '-enc',
        'UTF-8',
        "{}/my-pdf.pdf".format(SCRIPT_DIR),
        '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.

Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.


回答 4

您可能想使用久经考验的xPDF和派生工具来提取文本,因为pyPDF2似乎在文本提取方面仍然存在各种问题

长的答案是,在PDF中如何编码文本以及可能需要对PDF字符串本身进行解码,然后可能需要与CMAP映射,然后可能需要分析单词和字母之间的距离等,会有很多变化。

如果PDF损坏(即显示正确的文本,但是在复制时会产生垃圾),并且您确实需要提取文本,那么您可能需要考虑将PDF转换为图像(使用ImageMagik),然后使用Tesseract从图像中获取文本使用OCR。

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.

The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.

In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.


回答 5

我已经尝试了许多Python PDF转换器,并且我想更新此评论。 蒂卡是最好的之一。但是PyMuPDF是@ehsaneha用户的好消息。

我在下面的代码中做了比较:https : //github.com/erfelipe/PDFtextExtraction我希望对您有所帮助。

Tika-Python是与Apache Tika™REST服务结合的Python,允许在Python社区中本地调用Tika。

from tika import parser

raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)

safe_text = raw.encode('utf-8', errors='ignore')

safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

I’ve try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.

I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser

raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)

safe_text = raw.encode('utf-8', errors='ignore')

safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

回答 6

以下代码是Python 3中问题的解决方案。在运行代码之前,请确保已PyPDF2在您的环境中安装了库。如果未安装,请打开命令提示符并运行以下命令:

pip3 install PyPDF2

解决方案代码:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the PyPDF2 library in your environment. If not installed, open the command prompt and run the following command:

pip3 install PyPDF2

Solution Code:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

回答 7

在某些情况下,PyPDF2会忽略空白,并使结果文本变得一团糟,但是我使用PyMuPDF,非常满意您可以使用此链接获取更多信息

PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I’m really satisfied you can use this link for more info


回答 8

多页pdf可以单次提取为文本,而不必使用以下代码将单个页码作为参数

import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
   page = read_pdf.getPage(i)
   page_content = page.extractText()
   print page_content.encode('utf-8')

Multi – page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code

import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
   page = read_pdf.getPage(i)
   page_content = page.extractText()
   print page_content.encode('utf-8')

回答 9

pdftotext是最好,最简单的一种!pdftotext也保留了该结构。

我尝试了PyPDF2,PDFMiner和其他一些方法,但没有一个给出令人满意的结果。

pdftotext is the best and simplest one! pdftotext also reserves the structure as well.

I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.


回答 10

您可以使用PDFtoText https://github.com/jalan/pdftotext

PDF转换为文本会保留文本格式的缩进,无论您是否有表格都没关系。

You can use PDFtoText https://github.com/jalan/pdftotext

PDF to text keeps text format indentation, doesn’t matter if you have tables.


回答 11

这是提取文本的最简单代码

码:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(5)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()

Here is the simplest code for extracting text

code:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(5)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()

回答 12

我在这里找到了解决方案PDFLayoutTextStripper

很好,因为它可以保留原始PDF的布局

它是用Java编写的,但是我添加了一个网关来支持Python。

样例代码:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

PDFLayoutTextStripper的示例输出:

你可以在这里看到更多的细节剥线与Python

I found a solution here PDFLayoutTextStripper

It’s good because it can keep the layout of the original PDF.

It’s written in Java but I have added a Gateway to support Python.

Sample code:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

Sample output from PDFLayoutTextStripper:

You can see more details here Stripper with Python


回答 13

要从PDF提取文本,请使用以下代码

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)

For extracting Text from PDF use below code

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)

回答 14

与OCR相比,我有更好的解决方法,并且可以在从PDF中提取文本时保持页面对齐。应该有帮助:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)


    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

text= convert_pdf_to_txt('test.pdf')
print(text)

I’ve got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)


    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

text= convert_pdf_to_txt('test.pdf')
print(text)

回答 15

我正在添加代码来完成此任务:它对我来说很好:

# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings("ignore")


def download_file(url):
    local_filename = url.split('/')[-1]
    local_filename = local_filename.replace("%20", "_")
    r = requests.get(url, stream=True)
    print(r)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename


class PDFExtractor():
    def __init__(self, url):
        self.url = url

    # Downloading File in local
    def break_pdf(self, filename, start_page=-1, end_page=-1):
        pdf_reader = PdfFileReader(open(filename, "rb"))
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            output = PdfFileWriter()
            output.addPage(pdf_reader.getPage(i))
            with open(str(i + 1) + "_" + filename, "wb") as outputStream:
                output.write(outputStream)

    def extract_text_algo_1(self, file):
        pdf_reader = PdfFileReader(open(file, 'rb'))
        # creating a page object
        pageObj = pdf_reader.getPage(0)

        # extracting extract_text from page
        text = pageObj.extractText()
        text = text.replace("\n", "").replace("\t", "")
        return text

    def extract_text_algo_2(self, file):
        pdfResourceManager = PDFResourceManager()
        retstr = StringIO()
        la_params = LAParams()
        device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
        fp = open(file, 'rb')
        interpreter = PDFPageInterpreter(pdfResourceManager, device)
        password = ""
        max_pages = 0
        caching = True
        page_num = set()

        for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()
        text = text.replace("\t", "").replace("\n", "")

        fp.close()
        device.close()
        retstr.close()
        return text

    def extract_text(self, file):
        text1 = self.extract_text_algo_1(file)
        text2 = self.extract_text_algo_2(file)

        if len(text2) > len(str(text1)):
            return text2
        else:
            return text1

    def extarct_table(self, file):

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf(file, output_format="csv")
        except:
            print("Error Reading Table")
            return

        print("\nPrinting Table Content: \n", df)
        print("\nDone Printing Table Content\n")

    def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of extract_image
                           0  # last IFD
                           )

    def extract_image(self, filename):
        number = 1
        pdf_reader = PdfFileReader(open(filename, 'rb'))

        for i in range(0, pdf_reader.numPages):

            page = pdf_reader.getPage(i)

            try:
                xObject = page['/Resources']['/XObject'].getObject()
            except:
                print("No XObject Found")
                return

            for obj in xObject:

                try:

                    if xObject[obj]['/Subtype'] == '/Image':
                        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                        data = xObject[obj]._data
                        if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        else:
                            mode = "P"

                        image_name = filename.split(".")[0] + str(number)

                        print(xObject[obj]['/Filter'])

                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            data = xObject[obj].getData()
                            img = Image.frombytes(mode, size, data)
                            img.save(image_name + "_Flate.png")
                            # save_to_s3(imagename + "_Flate.png")
                            print("Image_Saved")

                            number += 1
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(image_name + "_DCT.jpg", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_DCT.jpg")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(image_name + "_JPX.jp2", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_JPX.jp2")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            if xObject[obj]['/DecodeParms']['/K'] == -1:
                                CCITT_group = 4
                            else:
                                CCITT_group = 3
                            width = xObject[obj]['/Width']
                            height = xObject[obj]['/Height']
                            data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                            img_size = len(data)
                            tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                            img_name = image_name + '_CCITT.tiff'
                            with open(img_name, 'wb') as img_file:
                                img_file.write(tiff_header + data)

                            # save_to_s3(img_name)
                            number += 1
                except:
                    continue

        return number

    def read_pages(self, start_page=-1, end_page=-1):

        # Downloading file locally
        downloaded_file = download_file(self.url)
        print(downloaded_file)

        # breaking PDF into number of pages in diff pdf files
        self.break_pdf(downloaded_file, start_page, end_page)

        # creating a pdf reader object
        pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))

        # Reading each pdf one by one
        total_pages = pdf_reader.numPages

        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            # creating a page based filename
            file = str(i + 1) + "_" + downloaded_file

            print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")

            file_text = self.extract_text(file)
            print(file_text)
            self.extract_image(file)

            self.extarct_table(file)
            os.remove(file)
            print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")

        os.remove(downloaded_file)


# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)

# Getting desired data out
pdf_extractor.read_pages(15, 23)

I am adding code to accomplish this: It is working fine for me:

# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings("ignore")


def download_file(url):
    local_filename = url.split('/')[-1]
    local_filename = local_filename.replace("%20", "_")
    r = requests.get(url, stream=True)
    print(r)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename


class PDFExtractor():
    def __init__(self, url):
        self.url = url

    # Downloading File in local
    def break_pdf(self, filename, start_page=-1, end_page=-1):
        pdf_reader = PdfFileReader(open(filename, "rb"))
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            output = PdfFileWriter()
            output.addPage(pdf_reader.getPage(i))
            with open(str(i + 1) + "_" + filename, "wb") as outputStream:
                output.write(outputStream)

    def extract_text_algo_1(self, file):
        pdf_reader = PdfFileReader(open(file, 'rb'))
        # creating a page object
        pageObj = pdf_reader.getPage(0)

        # extracting extract_text from page
        text = pageObj.extractText()
        text = text.replace("\n", "").replace("\t", "")
        return text

    def extract_text_algo_2(self, file):
        pdfResourceManager = PDFResourceManager()
        retstr = StringIO()
        la_params = LAParams()
        device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
        fp = open(file, 'rb')
        interpreter = PDFPageInterpreter(pdfResourceManager, device)
        password = ""
        max_pages = 0
        caching = True
        page_num = set()

        for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()
        text = text.replace("\t", "").replace("\n", "")

        fp.close()
        device.close()
        retstr.close()
        return text

    def extract_text(self, file):
        text1 = self.extract_text_algo_1(file)
        text2 = self.extract_text_algo_2(file)

        if len(text2) > len(str(text1)):
            return text2
        else:
            return text1

    def extarct_table(self, file):

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf(file, output_format="csv")
        except:
            print("Error Reading Table")
            return

        print("\nPrinting Table Content: \n", df)
        print("\nDone Printing Table Content\n")

    def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of extract_image
                           0  # last IFD
                           )

    def extract_image(self, filename):
        number = 1
        pdf_reader = PdfFileReader(open(filename, 'rb'))

        for i in range(0, pdf_reader.numPages):

            page = pdf_reader.getPage(i)

            try:
                xObject = page['/Resources']['/XObject'].getObject()
            except:
                print("No XObject Found")
                return

            for obj in xObject:

                try:

                    if xObject[obj]['/Subtype'] == '/Image':
                        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                        data = xObject[obj]._data
                        if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        else:
                            mode = "P"

                        image_name = filename.split(".")[0] + str(number)

                        print(xObject[obj]['/Filter'])

                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            data = xObject[obj].getData()
                            img = Image.frombytes(mode, size, data)
                            img.save(image_name + "_Flate.png")
                            # save_to_s3(imagename + "_Flate.png")
                            print("Image_Saved")

                            number += 1
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(image_name + "_DCT.jpg", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_DCT.jpg")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(image_name + "_JPX.jp2", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_JPX.jp2")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            if xObject[obj]['/DecodeParms']['/K'] == -1:
                                CCITT_group = 4
                            else:
                                CCITT_group = 3
                            width = xObject[obj]['/Width']
                            height = xObject[obj]['/Height']
                            data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                            img_size = len(data)
                            tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                            img_name = image_name + '_CCITT.tiff'
                            with open(img_name, 'wb') as img_file:
                                img_file.write(tiff_header + data)

                            # save_to_s3(img_name)
                            number += 1
                except:
                    continue

        return number

    def read_pages(self, start_page=-1, end_page=-1):

        # Downloading file locally
        downloaded_file = download_file(self.url)
        print(downloaded_file)

        # breaking PDF into number of pages in diff pdf files
        self.break_pdf(downloaded_file, start_page, end_page)

        # creating a pdf reader object
        pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))

        # Reading each pdf one by one
        total_pages = pdf_reader.numPages

        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            # creating a page based filename
            file = str(i + 1) + "_" + downloaded_file

            print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")

            file_text = self.extract_text(file)
            print(file_text)
            self.extract_image(file)

            self.extarct_table(file)
            os.remove(file)
            print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")

        os.remove(downloaded_file)


# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)

# Getting desired data out
pdf_extractor.read_pages(15, 23)

回答 16

您可以从here下载tika-app-xxx.jar(最新)。

然后将此.jar文件放入python脚本文件的同一文件夹中。

然后在脚本中插入以下代码:

import os
import os.path

tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')

def extract_pdf(source_pdf:str,target_txt:str):
    os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))

这种方法的优点:

更少的依赖。单个.jar文件更易于管理python软件包。

多格式支持。该位置source_pdf可以是任何类型文档的目录。(.doc,.html,.odt等)

最新。tika-app.jar总是比tika python软件包的相关版本更早发布。

稳定。它比PyPDF更加稳定和维护良好(由Apache提供支持)。

坏处:

无头Jre是必要的。

You can download tika-app-xxx.jar(latest) from Here.

Then put this .jar file in the same folder of your python script file.

then insert the following code in the script:

import os
import os.path

tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')

def extract_pdf(source_pdf:str,target_txt:str):
    os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))

The advantage of this method:

fewer dependency. Single .jar file is easier to manage that a python package.

multi-format support. The position source_pdf can be the directory of any kind of document. (.doc, .html, .odt, etc.)

up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.

stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.

disadvantage:

A jre-headless is necessary.


回答 17

如果您在Windows的Anaconda中尝试使用,PyPDF2可能无法处理某些具有非标准结构或Unicode字符的PDF。如果需要打开和阅读大量pdf文件,我建议使用以下代码-文件夹中具有相对路径的所有pdf文件的文本都.//pdfs//将存储在list中pdf_text_list

from tika import parser
import glob

def read_pdf(filename):
    text = parser.from_file(filename)
    return(text)


all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
    text=read_pdf(file)
    pdf_text_list.append(text['content'])

print(pdf_text_list)

If you try it in Anaconda on Windows, PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters. I recommend using the following code if you need to open and read a lot of pdf files – the text of all pdf files in folder with relative path .//pdfs// will be stored in list pdf_text_list.

from tika import parser
import glob

def read_pdf(filename):
    text = parser.from_file(filename)
    return(text)


all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
    text=read_pdf(file)
    pdf_text_list.append(text['content'])

print(pdf_text_list)

回答 18

PyPDF2确实可以,但是结果可能会有所不同。我从结果提取中发现不一致的结果。

reader=PyPDF2.pdf.PdfFileReader(self._path)
eachPageText=[]
for i in range(0,reader.getNumPages()):
    pageText=reader.getPage(i).extractText()
    print(pageText)
    eachPageText.append(pageText)

PyPDF2 does work, but results may vary. I am seeing quite inconsistent findings from its result extraction.

reader=PyPDF2.pdf.PdfFileReader(self._path)
eachPageText=[]
for i in range(0,reader.getNumPages()):
    pageText=reader.getPage(i).extractText()
    print(pageText)
    eachPageText.append(pageText)