How can I read a string from PDF file in C?

I want to create a program that process the edit distance from two file, My code works with strings read from a txt file. But now I want to read strings from PDF DOC exc. How can I read strings from this files? I tryed with the func fread but it not works. This is the code that i wrote:

void method () < FILE *file; char *str; if ((file = fopen("C:/Users/latin/Desktop/prova.pdf", "rb")) == NULL) < printf("Error!\n"); >fread(&str,18,1,file); printf("%s",str); > 
prova.pdf is a PDF file that contains this string : ciaoCiao merendina . asked Sep 1, 2020 at 16:34 user13465503 user13465503

Analyzing PDF files may be tricky. If you create a word processor document with your example text and convert it to PDF, the resulting PDF file does not necessarily contain the unchanged string. It may get split into parts or even converted into graphical elements. You probably need either a library that can extract text from a PDF file or a library/program to convert PDF to text.

Commented Sep 1, 2020 at 16:51

PDF is not at all a text format. It is an open standard though, so you could write something that could read from it. Better to grab a free lib to do it for you though if possible. resources.infosecinstitute.com/pdf-file-format-basic-structure/…

Commented Sep 1, 2020 at 18:42

Somewhere I read that I can read strings from PDF reading the binary code of it. Do you know how to do that?

– user13465503 Commented Sep 1, 2020 at 18:49

You read wrongly. There is no generic way to simply read text from a PDF file without understanding and using the format of the PDF file structure.

Commented Sep 1, 2020 at 21:02

1 Answer 1

It is possible to do this in plain C. Adobe did it. Artifex did it. Others have done it. But as commented, it is a ton of work. But I can outline the steps to give you a feel for what's involved.

First you could read the "Magic Number" at the start and check that it is actually a PDF. It should start with %PDF- followed by a version number. But apparently many PDF producers don't conform to this requirement.

Next, you need to skip to the very end of the file and read backwards, looking for something like:

startxref 1581 %%EOF 

That number is the byte-offset of the start of the X-Reference table which lists the binary offsets of all the "objects" in the file. An object can be a Page or a Font or a Content Stream or many other things.

Looking at the X-Reference table, you'll see something like this:

xref 0 4 0000000000 65535 f 0000000010 00000 n 0000000063 00000 n 0000000127 00000 n 0000000234 00000 n trailer > 

The line /Root 1 0 R tells you which object is the root of the document tree. You'll need to examine this object to find the top-level Pages object which looks like this:

2 0 obj > endobj 

The Kids element here contains a reference to the first Page object which looks like this:

3 0 obj > endobj 

Then you'll need to find the Contents object referenced here. A Content stream, if it's not encrypted or compressed, will show you the drawing commands and text commands being drawn to the page.

5 0 obj > stream BT F1 10.0 Tf 30.0 750.0 Td ( 

Text commands will always be bracketed by BT . ET . In here, you can finally see the strings wrapped in parens. But you'll have to pay attention to the coordinates 30.0 750.0 Td of each string to figure out which ones are part of the same logical line.

If the PDF was created from a word processor, it is likely to contain text in this form but with lots of caveats. It might have re-encoded fonts and the text strings will no longer represent ASCII characters but just positions in the font's encoding vector. If the PDF was created from a scanned document, it may just contain images of the pages with no text content at all unless it has gone through a conversion involving OCR.