从pdf提取页面作为jpeg

Question 1

在python代码中，如何有效地将pdf中的某个页面另存为jpeg文件？（用例：我有一个python flask网络服务器，将在其中上传pdf-s，并存储与每个页面相对应的jpeg-s。）

该解决方案已经结束，但是问题在于它不会将整个页面转换为jpeg。

Question 2

In python code, how to efficiently save a certain page in a pdf as a jpeg file? (Use case: I’ve a python flask web server where pdf-s will be uploaded and jpeg-s corresponding to each page is stores.)

This solution is close, but the problem is that it does not convert the entire page to jpeg.

Question 3

可以使用pdf2image库。

您可以使用以下方法简单地安装它：

pip install pdf2image

安装后，您可以使用以下代码获取图像。

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

以jpeg格式保存页面

for page in pages:
    page.save('out.jpg', 'JPEG')

编辑：Github repo pdf2image也提到它使用pdftoppm并且需要其他安装：

pdftoppm是执行实际操作的软件。它作为更大的软件包poppler的一部分分发。Windows用户必须为Windows安装poppler。Mac用户必须为Mac安装poppler。如果发行版未安装，则Linux用户将预先安装pdftoppm（已在Ubuntu和Archlinux上进行了测试），请运行sudo apt install poppler-utils。

您可以通过以下步骤使用anaconda在Windows下安装最新版本：

conda install -c conda-forge poppler

注意：http：//blog.alivate.com.au/poppler-windows/上提供的Windows版本最高为0.67，但请注意，0.68已于2018年8月发布，因此您将无法获得最新功能或错误修复。

Question 4

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for page in pages:
    page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it’s not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you’ll not be getting the latest features or bug fixes.

Question 5

我发现这个简单的解决方案PyMuPDF可以输出到png文件。请注意，该库被导入为“ fitz”，这是它使用的渲染引擎的历史名称。

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.loadPage(0)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)

Question 6

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as “fitz”, a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.loadPage(0)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)

Question 7

Python库pdf2image其实（在对方的回答中）没有做远不止推出 pdttoppm有subprocess.Popen，所以这里是一个短版，直接做：

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

这是Windows安装链接pdftoppm（包含在名为poppler的软件包中）：http : //blog.alivate.com.au/poppler-windows/

Question 8

The Python library pdf2image (used in the other answer) in fact doesn’t do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/

Question 9

无需在操作系统上安装Poppler。这将起作用：

点安装魔杖

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f[:-4] + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

Question 10

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f[:-4] + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

Question 11

@gaurwraith，为Windows安装poppler并使用pdftoppm.exe，如下所示：

从http://blog.alivate.com.au/poppler-windows/下载带有Poppler最新二进制文件/ dll的zip文件，然后解压缩到程序文件文件夹中的新文件夹。例如：“ C：\ Program Files（x86）\ Poppler”。
将“ C：\ Program Files（x86）\ Poppler \ poppler-0.68.0 \ bin”添加到您的SYSTEM PATH环境变量。
从cmd行安装pdf2image模块->“ pip install pdf2image”。
或者，如用户Basj所述，使用Python的子进程模块直接从代码中执行pdftoppm.exe。

@vishvAs vAsuki，此代码应通过子处理模块为给定文件夹中一个或多个pdf的所有页面生成所需的jpg：

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

或使用pdf2image模块：

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

Question 12

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

Download zip file with Poppler’s latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: “C:\Program Files (x86)\Poppler”.
Add “C:\Program Files (x86)\Poppler\poppler-0.68.0\bin” to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> “pip install pdf2image”.
Or alternatively, directly execute pdftoppm.exe from your code using Python’s subprocess module as explained by user Basj.

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

Question 13

他们是一个名为pdftojpg的实用程序，可用于将pdf转换为img

您可以在这里找到代码https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

Question 14

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

Question 15

对于基于Linux的系统，GhostScript的执行速度比Poppler快得多。

以下是pdf到图像转换的代码。

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript可以使用以下命令安装在macOS上 brew install ghostscript

其他平台的安装信息可在此处找到。如果您的系统上尚未安装它。

Question 16

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

Question 17

我使用pdf2image的（也许）简单得多的选项：

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

这是循环中bash脚本的一小部分，用于使用狭窄的投射设备。每5秒钟检查一次添加的pdf文件（全部）并进行处理。这是针对演示设备的，最终转换将在远程服务器上完成。现在可以转换为.PNG，但是.JPG也可以。

这种转换以及A4格式的过渡，显示视频，两个平滑滚动文本和徽标（三个版本中有过渡），将Pi3设置为最高4x 100％cpu-load ;-)

Question 18

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

Question 19

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')

Question 20

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')

Question 21

这是一种不需要其他库且速度非常快的解决方案。可以从以下网址找到它：https : //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# 我已在函数中添加了代码以使其更加方便。

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

以pdf路径作为参数调用convert，该函数将在同一目录中创建一个.jpg文件

Question 22

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

从pdf提取页面作为jpeg

问题：从pdf提取页面作为jpeg

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

字符串，值=>列表" data-prefetch="true">“太多的价值无法解包”，遍历一个字典。键=>字符串，值=>列表

将多个过滤器应用于pandas DataFrame或Series的有效方法

在Python中，如何将警告视为异常？

为什么非默认参数不能跟随默认参数？

优劣互补! Python+Go结合开发的探讨

检查列表中是否存在值的最快方法

从pdf提取页面作为jpeg

问题：从pdf提取页面作为jpeg

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

相关文章

排行榜展示

文章展示