问题:合并PDF文件
是否可以使用Python合并单独的PDF文件?
假设是这样,我需要进一步扩展。我希望遍历目录中的文件夹并重复此过程。
我可能会碰运气,但是有可能排除PDF中包含的页面(我的报告生成总是创建一个额外的空白页面)。
Is it possible, using Python, to merge separate PDF files?
Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.
And I may be pushing my luck, but is it possible to exclude a page that is contained in of the PDFs (my report generation always creates an extra blank page).
回答 0
使用Pypdf或其后续版本PyPDF2:
作为Python工具箱构建的Pure-Python库。它具有以下功能:
*逐页拆分文档,
* 逐页合并文档,
(以及更多)
这是适用于两个版本的示例程序。
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# /programming/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)
Use Pypdf or its successor PyPDF2:
A Pure-Python library built as a PDF toolkit. It is capable of:
* splitting documents page by page,
* merging documents page by page,
(and much more)
Here’s a sample program that works with both versions.
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)
回答 1
您可以使用PyPdf2的PdfMerger
类。
文件串联
您可以使用方法简单地串联文件append
。
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
您可以根据需要传递文件句柄而不是文件路径。
文件合并
如果要更精细地控制合并,可以使用的merge
方法,该方法PdfMerger
可以在输出文件中指定插入点,这意味着您可以将页面插入文件中的任何位置。该append
方法可以认为是merge
插入点位于文件末尾的位置。
例如
merger.merge(2, pdf)
在这里,我们将整个pdf插入到输出中,但在第2页。
页面范围
如果要控制从特定文件追加哪些页面,可以使用and 的pages
关键字参数,以格式传递元组(类似于常规函数)。append
merge
(start, stop[, step])
range
例如
merger.append(pdf, pages=(0, 3)) # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5
如果指定的范围无效,则会显示IndexError
。
注意:此外,为避免文件保持打开状态,在PdfFileMerger
写入合并文件后应调用s close方法。这样可确保及时关闭所有文件(输入和输出)。遗憾的PdfFileMerger
是没有作为上下文管理器来实现,因此我们可以使用with
关键字,避免显式的close调用并获得一些简单的异常安全性。
您可能还需要查看pdfcat
pypdf2中提供的脚本。您可以完全避免编写代码。
PyPdf2 github还包括一些示例代码,展示了合并。
You can use PyPdf2s PdfMerger
class.
File Concatenation
You can simply concatenate files by using the append
method.
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
You can pass file handles instead file paths if you want.
File Merging
If you want more fine grained control of merging there is a merge
method of the PdfMerger
, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The append
method can be thought of as a merge
where the insertion point is the end of the file.
e.g.
merger.merge(2, pdf)
Here we insert the whole pdf into the output but at page 2.
Page Ranges
If you wish to control which pages are appended from a particular file, you can use the pages
keyword argument of append
and merge
, passing a tuple in the form (start, stop[, step])
(like the regular range
function).
e.g.
merger.append(pdf, pages=(0, 3)) # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5
If you specify an invalid range you will get an IndexError
.
Note: also that to avoid files being left open, the PdfFileMerger
s close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It’s a shame that PdfFileMerger
isn’t implemented as a context manager, so we can use the with
keyword, avoid the explicit close call and get some easy exception safety.
You might also want to look at the pdfcat
script provided as part of pypdf2. You can potentially avoid the need to write code altogether.
The PyPdf2 github also includes some example code demonstrating merging.
回答 2
合并目录中存在的所有pdf文件
将pdf文件放在目录中。启动程序。您将合并所有pdf文件,得到一个pdf文件。
import os
from PyPDF2 import PdfFileMerger
x = [a for a in os.listdir() if a.endswith(".pdf")]
merger = PdfFileMerger()
for pdf in x:
merger.append(open(pdf, 'rb'))
with open("result.pdf", "wb") as fout:
merger.write(fout)
Merge all pdf files that are present in a dir
Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.
import os
from PyPDF2 import PdfFileMerger
x = [a for a in os.listdir() if a.endswith(".pdf")]
merger = PdfFileMerger()
for pdf in x:
merger.append(open(pdf, 'rb'))
with open("result.pdf", "wb") as fout:
merger.write(fout)
回答 3
假设您不需要保留书签和注释,并且您的PDF未被加密,该pdfrw
库可以非常轻松地做到这一点。 cat.py
是示例串联脚本,并且subset.py
是示例页面子设置脚本。
串联脚本的相关部分-假设inputs
是输入文件名列表,并且outfn
是输出文件名:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
从中可以看出,省去最后一页非常容易,例如:
writer.addpages(PdfReader(inpfn).pages[:-1])
免责声明:我是第一pdfrw
作者。
The pdfrw
library can do this quite easily, assuming you don’t need to preserve bookmarks and annotations, and your PDFs aren’t encrypted. cat.py
is an example concatenation script, and subset.py
is an example page subsetting script.
The relevant part of the concatenation script — assumes inputs
is a list of input filenames, and outfn
is an output file name:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:
writer.addpages(PdfReader(inpfn).pages[:-1])
Disclaimer: I am the primary pdfrw
author.
回答 4
是否可以使用Python合并单独的PDF文件?
是。
以下示例将一个文件夹中的所有文件合并为一个新的PDF文件:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os
def merge(path, output_filename):
output = PdfFileWriter()
for pdffile in glob(path + os.sep + '*.pdf'):
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
print("Start writing '%s'" % output_filename)
with open(output_filename, "wb") as f:
output.write(f)
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output",
dest="output_filename",
default="merged.pdf",
help="write merged PDF to FILE",
metavar="FILE")
parser.add_argument("-p", "--path",
dest="path",
default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.output_filename)
Is it possible, using Python, to merge seperate PDF files?
Yes.
The following example merges all files in one folder to a single new PDF file:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os
def merge(path, output_filename):
output = PdfFileWriter()
for pdffile in glob(path + os.sep + '*.pdf'):
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
print("Start writing '%s'" % output_filename)
with open(output_filename, "wb") as f:
output.write(f)
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output",
dest="output_filename",
default="merged.pdf",
help="write merged PDF to FILE",
metavar="FILE")
parser.add_argument("-p", "--path",
dest="path",
default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.output_filename)
回答 5
from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
def list_files(directory, extension):
return (f for f in os.listdir(directory) if f.endswith('.' + extension))
pdfs = list_files(dir_path, "pdf")
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(open(pdf, 'rb'))
with open('result.pdf', 'wb') as fout:
merger.write(fout)
webbrowser.open_new('file://'+ dir_path + '/result.pdf')
Git回购:https : //github.com/mahaguru24/Python_Merge_PDF.git
from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
def list_files(directory, extension):
return (f for f in os.listdir(directory) if f.endswith('.' + extension))
pdfs = list_files(dir_path, "pdf")
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(open(pdf, 'rb'))
with open('result.pdf', 'wb') as fout:
merger.write(fout)
webbrowser.open_new('file://'+ dir_path + '/result.pdf')
Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git
回答 6
在这里,http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/提供了解决方案。
类似地:
from pyPdf import PdfFileWriter, PdfFileReader
def append_pdf(input,output):
[output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]
output = PdfFileWriter()
append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)
output.write(file("c:\\combined.pdf","wb"))
here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.
similarly:
from pyPdf import PdfFileWriter, PdfFileReader
def append_pdf(input,output):
[output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]
output = PdfFileWriter()
append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)
output.write(file("c:\\combined.pdf","wb"))
回答 7
使用字典进行一些细微的改动以获得更大的灵活性(例如,sort,dedup):
import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
for file in files:
filepath = subdir + os.sep + file
# you can have multiple endswith
if filepath.endswith((".pdf", ".PDF")):
file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)
for k, v in file_dict.items():
print(k, v)
merger.append(v)
merger.write("combined_result.pdf")
A slight variation using a dictionary for greater flexibility (e.g. sort, dedup):
import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
for file in files:
filepath = subdir + os.sep + file
# you can have multiple endswith
if filepath.endswith((".pdf", ".PDF")):
file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)
for k, v in file_dict.items():
print(k, v)
merger.append(v)
merger.write("combined_result.pdf")
回答 8
我通过利用子进程在Linux终端上使用pdf unite(假设目录中存在one.pdf和two.pdf),目的是将它们合并为3.pdf
import subprocess
subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)
I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf
import subprocess
subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)