org.faceless.pdf2
Class PageExtractor.Text

java.lang.Object
  extended by org.faceless.pdf2.PageExtractor.Text
All Implemented Interfaces:
Comparable
Enclosing class:
PageExtractor

public abstract class PageExtractor.Text
extends Object
implements Comparable

A class representing a piece of text which is extracted from the PageExtractor. Each text object has a location on the page, font-size, font-name, color and text.

Since:
2.6.2

Constructor Summary
PageExtractor.Text()
           
 
Method Summary
abstract  int compareTo(Object o)
           
 AnnotationMarkup createAnnotationMarkup(String type)
          Create a new AnnotationMarkup of the specified type to cover this text.
 float getAngle()
          Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock.
abstract  float getBaseline()
          Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom.
abstract  int getByteLength()
          Get the length of the original text in bytes.
abstract  int getByteToCharOffset(int byteoffset)
          Given a byte offset into the original String, return the Character offset it refers to.
abstract  Paint getColor()
          Return the color of this text
 float[] getCorners()
          Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text, specified clockwise from bottom left.
abstract  Reader getFontMetaData()
          Return any XMP MetaData that has been set on the Font, or null if none exists.
abstract  String getFontName()
          Return the font name of this text
abstract  float getFontSize()
          Return the font size of this text in points
 float getLength()
          Return the length of this Text in points.
abstract  float getOffset(int pos)
          Given an offset into the text, return the start position of that letter.
 PDFPage getPage()
          Return the PDFPage this text was found on - simply the page the parent PageExtractor was created from.
 PageExtractor getPageExtractor()
          Return the PageExtractor this text was created from
abstract  PageExtractor.Text getPrimaryText()
          If this text is a subtext or collection of Text object, return the primary text it starts with.
abstract  int getPrimaryTextOffset()
          If this text is a subtext or collection of Text object, return the offset into the primary text where it starts.
abstract  PageExtractor.Text getRowNext()
          Return the next Text item in this row, or null if there are none
abstract  PageExtractor.Text getRowPrevious()
          Return the next Text item in this row, or null if there are none
abstract  PageExtractor.Text getSubText(int off, int len)
          Return a substring of this Text object as another Text object
abstract  String getText()
          Return the text content of this text
abstract  int getTextLength()
          Return the length of the String returned by getText()
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PageExtractor.Text

public PageExtractor.Text()
Method Detail

getLength

public float getLength()
Return the length of this Text in points. This method measures the baseline of the text, so for rotated text the value will always be positive regardless of the angle.

Returns:
the length of the text in points at its baseline

getCorners

public final float[] getCorners()
Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text, specified clockwise from bottom left. The text baseline runs from (x1,y1) to (x4,y4).


createAnnotationMarkup

public AnnotationMarkup createAnnotationMarkup(String type)
Create a new AnnotationMarkup of the specified type to cover this text. The annotation is not added to the page

Parameters:
type - the type of markup - "Highlight", "Underline" etc.
Since:
2.8

getAngle

public final float getAngle()
Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock. Most text is not rotated and so will return 0.

Returns:
the angle of the text

getFontSize

public abstract float getFontSize()
Return the font size of this text in points


getBaseline

public abstract float getBaseline()
Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom. The value will normally be 0.8

Since:
2.11.7

getOffset

public abstract float getOffset(int pos)
Given an offset into the text, return the start position of that letter. Because text may not be on a horizontal line, this value is returned as a float in the range 0 to 1 (0 being at the start of the text, 1 being the end). For the common case where text is horizontal, you can calculate it's start position like so:
 float left = text.getCorners()[0] + (text.getOffset(pos) * text.getLength());
 

Parameters:
pos - the position of the letter in the Text to retrive the position for. In the range 0 to getText().length() - 1
Since:
2.6.12

getPage

public PDFPage getPage()
Return the PDFPage this text was found on - simply the page the parent PageExtractor was created from.

Since:
2.6.12

getPageExtractor

public PageExtractor getPageExtractor()
Return the PageExtractor this text was created from

Since:
2.10.3

getColor

public abstract Paint getColor()
Return the color of this text

Returns:
the color

getFontName

public abstract String getFontName()
Return the font name of this text

Returns:
the name of the font

getText

public abstract String getText()
Return the text content of this text

Returns:
the text

getTextLength

public abstract int getTextLength()
Return the length of the String returned by getText()

Since:
2.11.7

compareTo

public abstract int compareTo(Object o)
Specified by:
compareTo in interface Comparable

getRowNext

public abstract PageExtractor.Text getRowNext()
Return the next Text item in this row, or null if there are none

Since:
2.10.3

getRowPrevious

public abstract PageExtractor.Text getRowPrevious()
Return the next Text item in this row, or null if there are none

Since:
2.10.3

getFontMetaData

public abstract Reader getFontMetaData()
                                throws IOException
Return any XMP MetaData that has been set on the Font, or null if none exists.

Throws:
IOException
Since:
2.11.6
See Also:
PDF.getMetaData()

getSubText

public abstract PageExtractor.Text getSubText(int off,
                                              int len)
Return a substring of this Text object as another Text object

Parameters:
off - the offset into the text
len - the number of characters to return
Since:
2.11.7

getPrimaryText

public abstract PageExtractor.Text getPrimaryText()
If this text is a subtext or collection of Text object, return the primary text it starts with. If not, returns null

Since:
2.11.7

getPrimaryTextOffset

public abstract int getPrimaryTextOffset()
If this text is a subtext or collection of Text object, return the offset into the primary text where it starts. If not, returns 0

Since:
2.11.7

getByteLength

public abstract int getByteLength()
Get the length of the original text in bytes. This method is required because the Highlight File Format contains references to the byte offset into the string, not the character offset (as it states).

Since:
2.11.12

getByteToCharOffset

public abstract int getByteToCharOffset(int byteoffset)
Given a byte offset into the original String, return the Character offset it refers to.

Since:
2.11.12
See Also:
getByteLength()


Copyright © 2001-2013 Big Faceless Organization