Apache PDFBox Library for manipulating PDF documents in Java

Apache PDFBox Library for manipulating PDF documents

In this post we will discuss about Apache PDFBox Library for manipulating PDF documents. The Apache PDFBox is an open source Java library for working with PDF documents. It allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities.

Dependencies

PDFBox consists of three related components and depends on a few external libraries. The three PDFBox components are named pdfbox, fontbox and jempbox.

Required libraries

Download all the required libraries and add it to the class path of your application. Otherwise, if you are having a maven project then add a maven dependency as below.

To add the pdfbox, fontbox, jempbox and commons-logging jars to your application, the easiest thing is to declare the Maven dependency shown below. This gives you the main pdfbox library directly and the other required jars as transitive dependencies.

Set the version field to the latest stable PDFBox version.

PDF document creation example

Below is a program for creating a sample PDF document using PDFBox.

Below is the screenshot of the PDF document created by the above program.

Apache PDFBox Library for manipulating PDF documents

Extracting text from a PDF file

To extract text from a PDF file, we need to use the PDFTextStripper class. Here is a simple program to extract the text from the PDF file we have created in the previous example.

Limiting The Extracted Text

There are several ways that we can limit the text that is extracted during the extraction process. The simplest is to specify the range of pages that you want to be extracted. For example, to only extract text from the second and third pages of the PDF document you could do this:

NOTE: The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.

If you wanted to start on page 2 and extract to the end of the document then you would just set the startPage property. By default all pages in the pdf document are extracted.

Working with Fonts

In our first example we have used TIMES ROMAN as the font. The PDType1Font class supports only 14 standard fonts. In order to use different font we need to use the PDType0Font class by specifying the font class to be used. Below is the program for document creation example with arial font.

Adding an image to PDF document

To add an image to PDF document we need to use PDImageXObject class. Here is a sample program.

Here is the screenshot of the PDF file created.

Apache PDFBox Library for manipulating PDF documents image

Note: Throughout the article I have used the latest API – PDFBox 2.0. In older versions you may need to use different classes / methods for writing / reading. Those classes / methods are deprecated in version 2.0.

We have seen some basic functionalities in PDFBox. Lot more can be done with the library. For further studies you can refer the below links,

Apache PDFBox – A Java PDF Library

Apache PDFBox API

The following two tabs change content below.
Working as a Java developer since 2010. Passionate about programming in Java. I am a part time blogger.
3 comments
  1. Hi! I’m trying to use the “Extracting text from a PDF” file example, but i’m getting the error below:

    Exception in thread “main” java.lang.NoSuchMethodError: org.apache.fontbox.afm.AFMParser.parse(Z)Lorg/apache/fontbox/afm/FontMetrics;

    Any suggestion?
    Regards
    Fernando

    • Have you included all the necessary jars in the classpath?

      You need to include fontbox-xxx.jar in the class path apart from Apache pdfbox jar as PDFBox internally uses fontbox-xxx. This will fix your issue.

Add Comment

Required fields are marked *. Your email address will not be published.