CAM::PDF - PDF manipulation library |
CAM::PDF - PDF manipulation library
Copyright 2002-2006 Clotho Advanced Media, Inc., http://www.clotho.com/
Copyright 2007-2008 Chris Dolan
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
use CAM::PDF; my $pdf = CAM::PDF->new('test1.pdf'); my $page1 = $pdf->getPageContent(1); [ ... mess with page ... ] $pdf->setPageContent(1, $page1); [ ... create some new content ... ] $pdf->appendPageContent(1, $newcontent); my $anotherpdf = CAM::PDF->new('test2.pdf'); $pdf->appendPDF($anotherpdf); my @prefs = $pdf->getPrefs(); $prefs[$CAM::PDF::PREF_OPASS] = 'mypassword'; $pdf->setPrefs(@prefs); $pdf->cleanoutput('out1.pdf'); print $pdf->toPDF();
Many example programs are included in this distribution to do useful
tasks. See the bin
subdirectory.
This package reads and writes any document that conforms to the PDF specification generously provided by Adobe at http://partners.adobe.com/public/developer/pdf/index_reference.html (link last checked Oct 2005).
The file format is well-supported, with the exception of the ``linearized'' or ``optimized'' output format, which this module can read but not write. Many specific aspects of the document model are not manipulable with this package (like fonts), but if the input document is correctly written, then this module will preserve the model integrity.
This library grants you some power over the PDF security model. Note that applications editing PDF documents via this library MUST respect the security preferences of the document. Any violation of this respect is contrary to Adobe's intellectual property position, as stated in the reference manual at the above URL.
Technical detail regarding corrupt PDFs: This library adheres strictly to the PDF specification. Adobe's Acrobat Reader is more lenient, allowing some corrupted PDFs to be viewable. Therefore, it is possible that some PDFs may be readable by Acrobat that are illegible to this library. In particular, files which have had line endings converted to or from DOS/Windows style (i.e. CR-NL) may be rendered unusable even though Acrobat does not complain. Future library versions may relax the parser, but not yet.
$self = CAM::PDF->new(content | filename | '-') $self->toPDF() $self->needsSave() $self->save() $self->cleansave() $self->output(filename | '-') $self->cleanoutput(filename | '-') $self->previousRevision() $self->allRevisions() $self->preserveOrder() $self->appendObject(olddoc, oldnum, [follow=(1|0)]) $self->replaceObject(newnum, olddoc, oldnum, [follow=(1|0)]) (olddoc can be undef in the above for adding new objects) $self->numPages() $self->getPageText(pagenum) $self->getPageDimensions(pagenum) $self->getPageContent(pagenum) $self->setPageContent(pagenum, content) $self->appendPageContent(pagenum, content) $self->deletePage(pagenum) $self->deletePages(pagenum, pagenum, ...) $self->extractPages(pagenum, pagenum, ...) $self->appendPDF(CAM::PDF object) $self->prependPDF(CAM::PDF object) $self->wrapString(string, width, fontsize, page, fontlabel) $self->getFontNames(pagenum) $self->addFont(page, fontname, fontlabel, [fontmetrics]) $self->deEmbedFont(page, fontname, [newfontname]) $self->deEmbedFontByBaseName(page, basename, [newfont]) $self->getPrefs() $self->setPrefs() $self->canPrint() $self->canModify() $self->canCopy() $self->canAdd() $self->getFormFieldList() $self->fillFormFields(fieldname, value, [fieldname, value, ...]) or $self->fillFormFields(%values) $self->clearFormFieldTriggers(fieldname, fieldname, ...)
Note: 'clean' as in cleansave()
and cleanobject()
means write a fresh
PDF document. The alternative (e.g. save())
reuses the existing doc
and just appends to it. Also note that 'clean' functions sort the
objects numerically. If you prefer that the new PDF docs more closely
resemble the old ones, call preserveOrder()
before cleansave()
or
cleanobject().
$self->toString() $self->getPage(pagenum) $self->getFont(pagenum, fontname) $self->getFonts(pagenum) $self->getStringWidth(fontdict, string) $self->getFormField(fieldname) $self->getFormFieldDict(object) $self->isLinearized() $self->decodeObject(objectnum) $self->decodeAll(any-node) $self->decodeOne(dict-node) $self->encodeObject(objectnum, filter) $self->encodeOne(any-node, filter) $self->changeString(obj-node, hashref)
$self->pageAddName(pagenum, name, objectnum) $self->getPageObjnum(pagenum) $self->getPropertyNames(pagenum) $self->getProperty(pagenum, propname) $self->getValue(any-node) $self->dereference(objectnum) or $self->dereference(name,pagenum) $self->deleteObject(objectnum) $self->copyObject(obj-node) $self->cacheObjects() $self->setObjNum(obj-node, num) $self->getRefList(obj-node) $self->changeRefKeys(obj-node, hashref)
$self->getObjValue(objectnum)
$self->_startdoc() $self->delinearlize() $self->build*() $self->parse*() $self->write*() $self->*CB() $self->traverse() $self->fixDecode() $self->abbrevInlineImage() $self->unabbrevInlineImage() $self->cleanse() $self->clean() $self->createID()
$content
can be a document in a
string, a filename, or '-'. The latter indicates that the document
should be read from standard input. If the document is password
protected, the passwords should be passed as additional arguments. If
they are not known, a boolean $prompt
argument allows the programmer to
suggest that the constructor prompt the user for a password. This is
rudimentary prompting: passwords are in the clear on the console.
This constructor takes an optional final argument which is a hash reference. This hash can contain any of the following optional parameters:
$prompt
argument described above.
toPDF()
toString()
(all of these functions are intended for internal only)
getRootDict()
getPagesDict()
parseObj($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return an object Node. This can be called as a class method in most circumstances, but is intended as an instance method.
parseInlineImage($string)
writeInlineImage($objectnode)
Given a fragment of PDF page content, parse it and return a stream Node. This can be called as a class method in most circumstances, but is intended as an instance method.
The dictionary Node argument is typically the body of the object Node that precedes this stream.
parseDict($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return an dictionary Node. This can be called as a class method in most circumstances, but is intended as an instance method.
parseArray($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return an array Node. This can be called as a class or instance method.
parseLabel($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a label Node. This can be called as a class or instance method.
parseRef($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a reference Node. This can be called as a class or instance method.
parseNum($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a number Node. This can be called as a class or instance method.
parseString($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a string Node. This can be called as a class or instance method.
parseHexString($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a hex string Node. This can be called as a class or instance method.
parseBoolean($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a boolean Node. This can be called as a class or instance method.
parseNull($string)
parseAny()
instead of this, if possible.
Given a fragment of PDF page content, parse it and return a null Node. This can be called as a class or instance method.
parseAny($string)
getValue($object)
Dereference a data object, return a value. Given an node object of any kind, returns raw scalar object: hashref, arrayref, string, number. This function follows all references, and descends into all objects.
getObjValue($objectnum)
Dereference a data object, and return a value. Behaves just like the
getValue()
function, but used when all you know is the object number.
dereference($objectnum)
Dereference a data object, return a PDF object as an node. This function makes heavy use of the internal object cache. Most (if not all) object requests should go through this function.
$name
should look something like '/R12'.
getPropertyNames($pagenum)
getPropertyNames()
returns an array of the names of
those resources. getProperty()
returns a node representing a
named property (most likely a reference node).
Returns a dictionary for a given font identified by its label, referenced by page.
getFontNames($pagenum)
Returns a list of fonts for a given page.
getFonts($pagenum)
Returns an array of font objects for a given page.
Returns a dictionary for a given font, referenced by page and the name of the base font.
Returns a data structure representing the font metrics for the named font. The property list is the results of something like the following:
$self->_buildNameTable($pagenum); my $properties = $self->{Names}->{$pagenum};
Alternatively, if you know the page number, it might be easier to do:
my $font = $self->dereference($fontlabel, $pagenum); my $fontmetrics = $font->{value}->{value};
where the $fontlabel
is something like '/Helv'. The getFontMetrics()
method is useful in the cases where you've forgotten which page number
you are working on (e.g. in CAM::PDF::GS), or if your property list
isn't part of any page (e.g. working with form field annotation
objects).
If a font metrics hash is supplied (it is required for a font other
than the 14 core fonts), then it is cloned and inserted into the new
font structure. Note that if those font metrics contain references
(e.g. to the FontDescriptor
), the referred objects are not copied --
you must do that part yourself.
For Type1 fonts, the font metrics must minimally contain the following
fields: Subtype
, FirstChar
, LastChar
, Widths
,
FontDescriptor
.
The optional $basefont
parameter allows you to change the font. This
is useful when some applications embed a standard font (see below) and
give it a funny name, like SYLXNP+Helvetica
. In this example, it's
important to change the basename back to the standard Helvetica
when
de-embedding.
De-embedding the font does NOT remove it from the PDF document, it just removes references to it. To get a size reduction by throwing away unused font data, you should use the following code sometime after this method.
$self->cleanse();
For reference, the standard fonts are Times-Roman
, Helvetica
, and
Courier
(and their bold, italic and bold-italic forms) plus Symbol
and
Zapfdingbats
. (Adobe PDF Reference v1.4, p.319)
Returns the width of the string, using the font metrics if possible.
numPages()
getPage($pagenum)
Returns a dictionary for a given numbered page.
getPageObjnum($pagenum)
Return the number of the PDF object in which the specified page occurs.
getPageText($pagenum)
getPageContentTree($pagenum)
getPageContent($pagenum)
getPageDimensions($pagenum)
x
, y
, width
and height
numbers that
define the dimensions of the specified page in points (1/72 inches).
Technically, this is the MediaBox
dimensions, which explains why
it's possible for x
and y
to be non-zero, but that's a rare
case.
For example, given a simple 8.5 by 11 inch page, this method will return
(0,0,612,792)
.
This method will die()
if the specified page number does not exist.
getName($object)
Given a PDF object reference, return it's name, if it has one. This is useful for indirect references to images in particular.
getPrefs()
owner password user password print boolean modify boolean copy boolean add boolean
See the PDF reference for the intended use of the latter four booleans.
This module publishes the array indices of these values for your convenience:
$CAM::PDF::PREF_OPASS $CAM::PDF::PREF_UPASS $CAM::PDF::PREF_PRINT $CAM::PDF::PREF_MODIFY $CAM::PDF::PREF_COPY $CAM::PDF::PREF_ADD
So, you can retrieve the value of the Copy boolean via:
my ($canCopy) = ($self->getPrefs())[$CAM::PDF::PREF_COPY];
canPrint()
canModify()
canCopy()
canAdd()
getFormFieldList()
fillFormFields()
function.
getFormField($name)
Return the object containing the form field definition for the
specified field name. $name
can be either the full name or the
``short/alternate'' name.
getFormFieldDict($formfieldobject)
Return a hash reference representing the accumulated property list for a form field, including all of it's inherited properties. This should be treated as a read-only hash! It ONLY retrieves the properties it knows about.
Important Note: Most PDF readers (Acrobat, Preview.app) only offer
one password field for opening documents. So, if the $ownerpass
and $userpass
are different, those applications cannot read the
documents. (Perhaps this is a bug in CAM::PDF?)
Note: any omitted booleans default to false. So, these two are equivalent:
$doc->setPrefs('password', 'password'); $doc->setPrefs('password', 'password', 0, 0, 0, 0);
Change the name of a PDF object structure.
removeName($object)
Delete the name of a PDF object structure.
Append a named object to the metadata for a given page.
getPageContent()
function and some
manipulation of the returned string from that function.
extractPages($pages...)
deletePages($pages...)
deletePage($pagenum)
appendPDF($pdf)
Note that this can break documents with annotations. See the appendpdf.pl script for a workaround.
prependPDF($pdf)
appendPDF()
except the new document is inserted on page 1
instead of at the end.
duplicatePage($pagenum)
$pagenum + 1
.
If $leaveblank
is true, the new page does not get any content.
Thus, the document is broken until you subsequently call
setPageContent().
createStreamObject($content)
Create a new Stream object. This object is NOT added to the document.
Use the appendObject()
function to do that after calling this
function.
uninlineImages()
uninlineImages($pagenum)
BI
and ID
a lot.
Like replaceObject(), the second form allows you to append a newly-created block to the PDF.
If the other document is undefined, then the object to copy is taken to be an anonymous object that is not part of any other document. This is useful when you've just created that anonymous object.
deleteObject($objectnum)
cleanse()
createID()
Generate a new document ID. Contrary the Adobe recommendation, this is a random number.
getFormFieldList()
function. The argument list can be a hash if you
like. A simple way to use this function is something like this:
my %fields = (fname => 'John', lname => 'Smith', state => 'WI'); $field{zip} = 53703; $self->fillFormFields(%fields);
If the first argument is a hash reference, it is interpreted as options for how to render the filled data:
clearAnnotations()
previousRevision()
clean()
was not invoked on it), return a new instance representing
that previous version. Otherwise return void. If this is an
encrypted PDF, this method assumes that previous revisions were
encrypted with the same password, which may be an incorrect
assumption.
allRevisions()
previousRevision
until
there are no more previous revisions. Returns a list of instances
from newest to oldest including this instance as the newest.
preserveOrder()
isLinearized()
delinearize()
Undo the tweaks used to make the document 'optimized'. This function is automatically called on every save or output since this library does not yet support linearized documents.
clean()
cleansave()
and
cleanoutput().
needsSave()
save()
method needs to be
called. Like save(), this has nothing to do with whether the document
has been saved to disk, but whether the in-memory representation of
the document has been serialized.
save()
This function operates solely in memory. It DOES NOT write the
document to a file. See the output()
function for that.
cleansave()
clean()
function, then call the save()
function.
output($filename)
output()
save()
function is called first to
serialize the data structure. If no filename is specified, or if the
filename is '-', the document is written to standard output.
Note: it is the responsibility of the application to ensure that the PDF document has either the Modify or Add permission. You can do this like the following:
if ($self->canModify()) { $self->output($outfile); } else { die "The PDF file denies permission to make modifications\n"; }
cleanoutput($file)
cleanoutput()
clean()
function, then call the output()
function to write a
fresh copy of the document to a file.
writeObject($objnum)
writeString($string)
writeAny($node)
In many cases, it's useful to apply one action to every node in an
object tree. The routines below all use this traverse()
function.
One of the most important parameters is the first: the $dereference
boolean. If true, the traversal follows reference Nodes. If false,
it does not descend into reference Nodes.
decodeObject($objectnum)
Remove any filters (like compression, etc) from a data stream indicated by the object number.
decodeAll($object)
Remove any filters from any data stream in this object or any object referenced by it.
decodeOne($object)
Remove any filters from an object. The boolean flag $save
(defaults to
false) indicates whether this removal should be permanent or just
this once. If true, the function returns success or failure. If
false, the function returns the defiltered content.
getRefList($object)
Return an array all of objects referred to in this object.
Renumber all references in an object.
abbrevInlineImage($object)
unabbrevInlineImage($object)
regex(...)
then it is interpreted as a Perl regular expression and is eval'ed.
Otherwise the search-and-replace is literal.
CAM::PDF->rangeToArray(1, 15, '1,3-5,12,9', '14-', '8 - 6, -2');
becomes
(1,3,4,5,12,9,14,15,8,7,6,1,2)
trimstr($string)
copyObject($node)
cacheObjects()
asciify($string)
This library was primarily developed against the 3rd edition of the reference (PDF v1.4) with a few updates from fourth edition. This library focuses on PDF v1.2 features. Nonetheless, it should be forward and backward compatible in the majority of cases.
This module is written with good speed and flexibility in mind, often
at the expense of memory consumption. Entire PDF documents are
typically slurped into RAM. As an example, simply calling
new('PDFReference15_v15.pdf')
(the 14 MB Adobe PDF Reference V1.5
document) pushes Perl to consume 84 MB of RAM on my development
machine.
There are several other PDF modules on CPAN. Below is a brief description of a few of them.
This is the leading PDF library, in my opinion.
Excellent text and font support. This is the highest level library of the bunch, and is the most complete implementation of the Adobe PDF spec. The author is amazingly responsive and patient.
Excellent compression support (CAM::PDF cribs off this Text::PDF feature). This has not been developed since 2003.
This library is not object oriented, so it can only process one PDF at a time, while storing all data in global variables.
CAM::PDF is the only one of these that has regression tests.
Currently, CAM::PDF has test coverage of about 50%, as reported by
Build testcover
.
Additionally, PDFLib is a commercial package not on CPAN (www.pdflib.com). It is a C-based library with a Perl interface. It is designed for PDF creation, not for reuse.
The data structure used to represent the PDF document is composed primarily of a hierarchy of Node objects. Every node in the document tree has this structure:
type => <type> value => <value> objnum => <object number> gennum => <generation number>
where the <value> depends on the <type>, and <type> is one of
Type Value ---- ----- object Node stream byte string string byte string hexstring byte string number number reference integer (object number) boolean "true" | "false" label string array arrayref of Nodes dictionary hashref of (string => Node) null undef
All of these except ``stream'' are directly related to the PDF data types of the same name. Streams are treated as special cases in this library since the have a non-general syntax and placement in the document body. Internally, streams are very much like strings, except that they have filters applied to them.
All objects are referenced indirectly by their numbers, as defined in
the PDF document. In all cases, the dereference()
function should be
used to deserialize objects into their internal representation. This
function is also useful for looking up named objects in the page model
metadata. Every node in the hierarchy contains its object and
generation number. You can think of this as a sort of a pointer back
to the root of each node tree. This serves in place of a ``parent''
link for every node, which would be harder to maintain.
The PDF document itself is represented internally as a hash reference with many components, including the document content, the document metadata (index, trailer and root node), the object cache, and several other caches, in addition to a few assorted bookkeeping structures.
The core of the document is represented in the object cache, which is only populated as needed, thus avoiding the overhead of parsing the whole document at read time.
Chris Dolan
This module was originally developed by me at Clotho Advanced Media Inc. Now I maintain it in my spare time.
Thanks to all the people who have submitted bug reports over the years! I've belatedly started crediting people in the CHANGES file. Apologies to contributors I've overlooked...
CAM::PDF - PDF manipulation library |