From grdetil@scrc.umanitoba.ca Thu Feb 25 13:49:00 1999 Date: Thu, 25 Feb 1999 14:13:46 -0600 (CST) From: Gilles Detillieux To: htdig@htdig.org Subject: [htdig] Using pdftotext to index PDF documents OK, folks, I'm doing some major back-pedaling! I've decided to give up on acroread and htdig/PDF.cc after all, and I've switched to pdftotext in an external parser. Here are my reasons: 1) acroread isn't open source, xpdf is. 2) Parsing PDFs is not straightforward, nor is parsing acroread's PS output. Sylvain made a valiant attempt at it, but I think there are too many exceptions that don't fit the cases his code handles. 3) Derek Noonburg really did his homework when he developed xpdf and pdftotext. Patrick reported that it worked well with all his PDFs. I think if we was good, open source support for PDFs, this is the way to go. 4) Derek also fixed my problem with pdftotext concatenating words in some of my PDFs. There are still a few quirks, where some words are concatenated, but it's MUCH better now. Also, pdftotext doesn't misplace the large caps like the various PostScript-based solutions did. So, with this latest fix, this is the package I want to use! So, after reconsidering, I think htdig/PDF.cc probably ought to be scrapped. (Sorry, Sylvain.) I don't know about integrating the xpdf code right into htdig, but I think as an external parser this is the package to use. I think Patrick was right that pdftotext does a better job of extracting text from a PDF than any other tool around. There's still a bit more work to be done. Patrick mentioned that pdftotext changed hyphens to spaces. Not so, but parse_doc.pl does. In fact, it converts all punctuation to spaces, to separate out the words. The problem is right now, the word list is what it spits out for the "h" record as well. So there's no punctuation at all in the excerpts! I'm sure this would be fairly easy to fix, and I hope to get to it later today. I want to make its text parsing similar to the parsing done by htdig/Plaintext.cc. (Which raises the question: "why can't an external parser just pass plain text or HTML to htdig for further parsing?") Some users may also want to extract the titles from their PDFs, as Sylvain's code did. parse_doc.pl doesn't do that right now, but with a bit more coding, using the pdfinfo utility in xpdf, it would be an easy addition. I haven't done it because my PDFs didn't have reasonable titles anyway, so I'd just as soon use the file name. Anyway, here's Derek's fix for my concatenation problem: --- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998 +++ xpdf/TextOutputDev.cc Thu Feb 25 09:55:28 1999 @@ -217,6 +217,7 @@ void TextPage::addChar(GfxState *state, double x1, y1, w1, h1; state->transform(x, y, &x1, &y1); + dx -= state->getCharSpace(); state->transformDelta(dx, dy, &w1, &h1); curStr->addChar(state, x1, y1, w1, h1, c, useASCII7); } And to bring you up to speed, here is my dialogue with Derek: > From: Gilles Detillieux > Subject: Re: bug in pdftotext in xpdf 0.80 > To: derekn@foolabs.com (Derek B. Noonburg) > Date: Thu, 25 Feb 1999 10:04:23 -0600 (CST) > > Hi again, Derek. Thanks for the prompt response, and the bug fix too! > > According to Derek B. Noonburg: > > > I have some strange PDF files, though, which come from Corel DRAW documents, > > > and these seem to confuse pdftotext. For example, if you try it out on: > > > > > > http://www.scrc.umanitoba.ca/SCRC/profile/profile_rob_98.pdf > > > > > > You'll see that most of the words are concatenated. However, when I > > > view it in xpdf, it looks fine, and when I pass it through pdftops, and > > > pass the PS file through ps2ascii (from gs 3.33), it also comes out OK. > > > I'd appreciate it if you can solve this little problem. The file seems > > > to crank the character spacing way up with a Tc command, and uses this > > > as a word spacing, rather than using actual space characters or motion > > > commands. > > > > You're right about the cause of the problem. Pdftotext was using the > > "delta-x" for the character (width + char spacing) instead of just the > > width. > > > > The fix is simple, if you don't mind recompiling. In > > xpdf/TextOutputDev.cc, insert a line in TextPage::addChar(): > > > > void TextPage::addChar(GfxState *state, double x, double y, > > double dx, double dy, Guchar c) { > > double x1, y1, w1, h1; > > > > state->transform(x, y, &x1, &y1); > > dx -= state->getCharSpace(); // insert this line > > state->transformDelta(dx, dy, &w1, &h1); > > curStr->addChar(state, x1, y1, w1, h1, c, useASCII7); > > } > > I don't mind recompiling at all. I'll post a patch to the ht://Dig mailing > list, as we've been discussing using this tool as a PDF parser for indexing > PDF documents on a web site. Right now, htdig uses acroread to spit out > PS, and does some rudimentary parsing on the PS output. It sort of works, > but there have been problems with it. Also, acroread isn't open source, but > your tools are, so a lot of users are very interested in switching over. > > > > Don't worry about the misplaced Ns -- these happen because Corel DRAW > > > outputs the large caps before the rest of the text. > > > > I just tried pdftotext, and these aren't misplaced... I'm not sure what > > you mean. > > > > Thanks for the bug report. > > Thanks for the bug fix! You're right, the Ns aren't misplaced at all. I > was confusing this tool with the "pdftops ... | ps2ascii" pipeline, which > did misplace the Ns. All the more reason to use pdftotext for indexing! > > Thanks again for the help. -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.