martes, 23 de septiembre de 2014

Recovering format from old document badly converted to Microsoft Word

I have just received an old document, with .doc extension
It is eduaction related: I am not sharing any detail, not even the file name, just some screenshots with minimum details
It has an awful format when it is imported in LibreOfficeDev 4.3



It seems to have been written with QuarkXPress (it has different margins for even and odd pages, something typical in that tool), and after that, it was exported to Microsoft Word, so somehow all format was lost in conversion, and it is imposible to recover it (for example, hyphenation is now "hard coded" in document, so this has to be solved by hand ... or by a smart macro)


Some info, with "file" in unbutu. The file it is just a Microsoft Word document (it could be another format but just renamed as .doc)

XXX.doc: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1, Code page: 1252, Title: IES Calatalifa, Author: USUARIO, Template: Normal.dotm, Last Saved By: DIRECCION, Revision Number: 2, Name of Creating Application: Microsoft Office Word, Total Editing Time: 03:00, Last Printed: Wed Jul  2 09:48:00 2014, Create Time/Date: Wed Sep 10 09:07:00 2014, Last Saved Time/Date: Wed Sep 10 09:07:00 2014, Number of Pages: 103, Number of Words: 24344, Number of Characters: 133897, Security: 0

The options to change format are minimal:
-The whole document has the same style, just one style applied
-Each page as its own "frame", so trying to select the whole document with "Ctrl+A" just selects one page....

There are two landscape pages where the situation is even worse: in the Microsoft Word conversion, EACH CHARACTER! is in its own frame ...


So, it has to be done manually.
The document has just 69 pages, and no images, with not too much formating (just a vertical table): there is no other option but to recover all text, to make a new document, undo manual hyphenations, and start formating from zero.
1. How to recover all text?
-Portrait "normal pages": select page by page in LibreOffice ... or export it as pdf, and copy from pdf, where "Ctrl+A" allows full selection
-Landscape pages: selection from LibreOffice is not possible (just one character), and from pdf table format is lost, and character have spaces between them...

Trying
pdftotext -layout -f 67 -l 68 XXX.pdf
there is a somehow cleaner output, but with spaces between characteres anyway



No hay comentarios:

Publicar un comentario