Ten months after its 2.0 release comes version 3 of Docvert. It builds upon OpenOffice.org or Abiword and converts any word processing document to HTML, DocBook, RSS, or any other XML format. People can migrate away from Word with this tool, or integrate it into their tool-chain with its REST interface.
i cna’t really access the website before i get RTFMd by the masses so I only have the article description to go by in the way of commen. It is a nice idea but will it really help in people migrating from MS WORD. If it is build on OOffice then it will rely on ITS word filters, and we all know that they are not suitable for reading all Microsoft Documents.
To help people move away from MS Office solutions a robust tool is needed to ensure, or give the user the confidence , that there will be minimal mangaliing of the documents formatting. There many situations where huge documents make it impractical for searching visually for formattiing errors.
If this tool where built on the windows side where most MS users are , and integrated nicely into MS Word ,where most MS will obviuosly be then I can see it
giving businesses an option to start migration …this more or less applies to academic case and the usercase to a lesser degree. …..
now that i have reached this far , and the web page has finally loaded i realise that i have some what missed the mark with my commentary , đ i guess i should reightfully say ,ignore the above
Another simpler option is called antiword. It basically dumps out a word doc as plain text, which is nice for piping around and whatever else. Not as robust as going to an XML solution or anything the tool in the article mentions, but hey it’s portable, small, and has no dependencies.
http://www.winfield.demon.nl/
I’ll have to try out the article’s tool too, although it looks like once everything is setup, I could have just opened the word doc in OO.org and resaved as a different format…
Antiword can do PDF, PS, and XML output as well. In my experience (admittedly with rather simple documents), Antiword does an excellent job.
Besides, Abiword uses wv2 for import of MS Word documents–and wv2 is available as a standalone console application. (KWord also uses wv2.)
If you have MS Office installed and you need to convert ‘doc’ file into a ‘odt’ file, then you’d have to have application that has to open ‘odt’ file. Then, each of the applications, Abiword, OO Writer and KWord can deal with both MS Office and Open standard formats, and that means you can simply do ‘Save ass..’ and save the ‘doc’ file into ‘odt’, HTML or plain text file (many more formats).
What is so spectaculous about this tool?
maybe because they just released version 3 :/ !
Could be why Office 2007 announced they are changing the file format, (snuck it in while everyone was blinded by the new tool bar), the old one is just too easily converted to a something usable.
The new format is probably easier to convert than the old actually.
The new format is probably easier to convert than the old actually.
What feature of the new format makes it easier to convert than the old one?
What feature of the new format makes it easier to convert than the old one?
It’s publicly documented XML rather than internally documented binary.
//
{{What feature of the new format makes it easier to convert than the old one?}}
It’s publicly documented XML rather than internally documented binary.//
I beg to differ.
http://www.consortiuminfo.org/standardsblog/article.php?story=20070…
http://www.groklaw.net/article.php?story=2007011720521698
http://www.grokdoc.net/index.php/EOOXML_objections
http://www.grokdoc.net/index.php/EOOXML_at_JTC-1
The new version of Microsoft Office formats (known as OOXML) is publically documented obscred, internal (and unspecified) Microsoft-dependent, reinvent-the-wheel-at-every-turn-in-order-to-avoid-open-standards, locked-in XML.
The best advice is to avoid OOXML like the plague.
DO NOT use Office Open XML format to save your documents in. If you save documents in OOXML, you will be up for an absolute fortune going forward.
Edited 2007-01-19 05:09
The point stands, regardless of what point you’re trying to make, about this new format being easier to convert than the previous. The previous was all binary. This is mostly XML.
The best advice is to avoid ODF fanatics.
//The previous was all binary. This is mostly XML. //
Correction: it is mostly an XML wrapper around the previous all binary blobs.
//this new format being easier to convert than the previous.//
I don’t think so.
Some existing applications do a fair job of handling previous all-binary MS Office formats, including some existing versions of MS Office, up to and including Office 2003.
Hence, OpenOffice.org or any existing version of MS Office in conjunction with a proper ODF plug-in (not the MS-sponsored one, but a proper one) is an immeasurably better option than Office 2007 and OOXML.
//The best advice is to avoid ODF fanatics.//
http://www.consortiuminfo.org/standardsblog/article.php?story=20070…
Andy Updergrove:
http://www.gesmer.com/attorneys/updegrove.php
http://www.gesmer.com/practice_areas/consortium.php
is hardly a fanatic.
sappyvcv OTOH clearly is a fanatic.
Take the best advice folks, and DO NOT USE the OOXML formats for your documents.
You will save an absolute fortune down the track if you heed that sage advice now.
Correction: it is mostly an XML wrapper around the previous all binary blobs.
Correction: Unless you embed images or other media, it is XML. This is fact.
Do not spread FUD.
I was telling people to avoid *you*, not this Andy character.
Hey, if you think I’m a fanatic for stating facts, well that’s your own problem.
//Correction: Unless you embed images or other media, it is XML. This is fact.
Do not spread FUD.
I was telling people to avoid *you*, not this Andy character.//
Here are the facts:
Refer here:
http://www.groklaw.net/article.php?story=2007011720521698
(section – comments by Marbux):
“ File formats with no specification
…
However, the specifications for those legacy Microsoft file formats â the sole justification offered for duplicating the functionality of the OpenDocument standard â appear nowhere in the EOOXML specification and are unavailable to other developers. 3 Yet those formats’ implementation is mandatory for conformance with the specification.
…
The unavailability of the specifications for virtually all of the the legacy file formats also clashes irreconcilably with the verifiability requirements of section A.4 of Annex A to IEC Directives, Rules for the structure and drafting of International Standards, Part 2, (“[w]hatever the aims of a product standard, only such requirements shall be included as can be verified”). If compatibility with and implementations of the specifications for those legacy formats are mandatory for conformance with the proposed standard, disclosure of the specifications for the legacy file formats is necessary even to consider whether EOOXML achieves Ecma’s stated goal of compatibility with those formats.
…
Vendor-specific application dependencies
The EOOXML specification is inappropriately replete with dependencies on a single vendor’s software. As an example, “autoSpaceLikeWord95” (page 2161) merely defines semantics in reference to a legacy application whose specific behavior is nowhere specified. Instead, vendors are repeatedly urged to study the referenced applications to determine appropriate behavior. But no relevant specification is available for other developers to use and Microsoft’s Open Specification Promise grants no right to decompile and reverse engineer the company’s legacy applications.
…
The EOOXML specification also creates barriers to interoperability where such barriers seem gratuitous. For example, the “Workbook Protection” section (page 2698) defines an encryption algorithm by including several pages of C-language source code that appears to have byte-ordering dependencies that will produce different results on different machine architectures. Likewise, the “Clipboard Data” section (page 5905) defines a schema type that can encode clipboard format values for Windows and the Macintosh, but appears not to allow for use by other operating systems. Yet another is “Conditional Formatting Bitmask” (page 2478), which mandates the use of bitmasks. Some of the standard XML processing tools like XSLT lack bitwise operators, making the use of such data impossible when converting to other XML formats.
For a multitude of reasons such as those summarized above, one must question Ecma’s undisclosed reasons for selectively including a mass of required behaviors of implementing applications for “legacy reasons” whilst simultaneously disclaiming any responsibility to specify critical application functionality needed by other developers to fully implement the specification (page 13):
Existing files and applications exercise a broad range of formats and functionality that, if required by the conformance definition, would add an impractical amount of bulk to This Standard and could inadvertently obligate new applications to implement a prohibitive amount of functionality. This issue is caused by the breadth of currently available functionality and is compounded by the existence of legacy formats.
That is a somewhat less-than-compelling argument for withholding specifications required to be implemented by the repeated usage of the mandatory “shall” and “shall not” that appears throughout the specification.”
I was telling people not to listen to *your* astroturfing. Clear enough?
//Hey, if you think I’m a fanatic for stating facts, well that’s your own problem.//
The problem with that claim is, you simply are not stating the facts.
I can’t phrase it any more delicately than that.
What part of me saying “embed” did you not understand?
You’re clearly not getting it. I’m sorry.
//What part of me saying “embed” did you not understand? //
What part of my disguised “what a load of crock” did you not understand? OK, so I tried to be polite, but clearly you are not getting it.
Here is someone who does get it:
http://community.zdnet.co.uk/blog/0,1000000567,10004805o-2000331777…
Thankfully, more and more people every day are getting it.
Don’t use OOXML. If you do, you will be soooooooo sorry later on (and much poorer besides) after you find that you have locked yourself in.
Edited 2007-01-20 06:21
So apparently you didn’t understand any of it. Ok, thanks for playing kiddo.
//So apparently you didn’t understand any of it. Ok, thanks for playing kiddo.//
Nice try, but no cigar.
Here, read this analysis:
http://www.robweir.com/blog/labels/OOXML.html
“According to the schema, these alternate formats may be the main content of the document, or specifically applied to comments, endnotes, footer, footnotes or headers.
Let’s parse the original more closely, starting by defining some terms:
* The term âpartâ in OOXML refers to the individual items (XML documents, images, scripts, other binary blobs, etc.) contained in the OOXML Zip file, which they call a âpackageâ. So a package is made up of one or more parts.
* HTML should be self-evident. But does this also include the HTML-like output from earlier versions of Word, which wasn’t always well-formed?
* MHTML what you get when you save a âcomplete web pageâ within Internet Explorer. It is MIME-encoded version of the HTML page plus the embedded images. MHTML is listed as a having a status of âProposed Standardâ in the IETF, but it appears to have been held at that state since 1999. (Does anyone know why it never advanced to the Standard status?)
* RTF – Rich Text Format is a proprietary document format occasionally updated by Microsoft. As one wag quipped, âRTF is defined as whatever Microsoft Word exports when it exports to RTFâ.
* WordProcessingML – I’ve seen this term used to refer to the XML format of Word 2003 as well as Word 2007. Presumably the 2003 version is intended here?
As you can see, we have several problems here from a specification standpoint.
First, no versions are specified for HTML, MHTML, RTF or WordProcessingML. Are we supposed to support all versions of of these? Only some? Does this include WordProcessingML from beta versions of Office 2007 as well?
Second, the specification provides no normative references for MHTML, RTF or âearlier versions of WordProcessingMLâ.
Third, this is a closed list of formats that seems biased toward Microsoft’s legacy formats. Why not XHTML? Why not DocBook? Why not TeX or troff? Why not ODF? Is there a legitimate reason to restrict the set of supported formats in this way?
Fourth, âplain textâ is not a phrase I like to see in file format specification, since it is undefined. No encoding is mentioned. What is meant here? ASCII, Latin-1, UTF-8. UTF-16, EBCDIC? Some of the above? All of the above? What encodings are included under the name âplain textâ?
Reading further we have:
A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML, and if that consumer is also a WordprocessingML producer, it shall emit the legacy text in WordprocessingML format.
Three words should raise an eyebrow. The first is the use of the word âequivalentâ and the other two are the instances of the word âshallâ. âShallâ is spec talk for a requirement, something a conformant application must do. According to Annex H of ISO Directives Part 2, âRules for the Structure and Drafting of International Standardsâ, the word âshallâ is used,âto indicate requirements strictly to be followed in order to conform to the document and from which no deviation is permitted.â
So, compliant consumers are required to take input from a variety of formats and convert them in the “equivalent” WordProcessingML. Putting aside the question as to what version or versions of HTML are intended, there is nothing here that defines the mapping between any version of HTML and WordProcessingML. So the conversion is application-defined. Considering that this is indicated to be a required feature of a conformant application, I find the lack of specificity here disturbing. How can there ever be interoperable processing of OOXML documents if this is not defined?
Reading the OOXML specification a little further down:
This Standard does not specify how one might create a WordprocessingML package that contains Alternative Format Import relationships and altChunk elements.
However, a conforming producer shall not create a WordprocessingML package that contains Alternative Format Import relationships and elements.
âShall notâ is another one of the special specification words. So, essentially, we’re not allowed, in a conforming application, to create a document with Alternative Format Input Parts, but if we read a document that has one, then we are required to process it, transforming it into equivalent WordProcessingML.
Further, we get this informative note:
Note: The Alternative Format Import machinery provides a one time conversion facility. A producer could have an extension that allows it to generate a package containing these relationships and elements, yet when run in conforming mode, does not do so.
Putting on my tinfoil hat for a moment, I find this all rather fishy. The OOXML specification, at 6,000+ pages has now just sucked in the complexity of one or more versions of HTML, MHTML, RTF and WordProcessingML. It requires that a conformant application understand these formats, but forbids a conformant application from producing them.
This is another example of how you never know what you’re getting when you get an OOXML file. To support OOXML is not to support a single format, or even a single family of formats. To fully support OOXML requires that you support OOXML plus a motley hodgepodge of various other formats, deprecated, abandoned and proprietary. The cost of compatibility with billions of legacy Microsoft documents is that you must support their legacy of years of false starts and restarts in the file format arena.
When you get an OOXML document, you don’t know what is inside. It might use the deprecated VML specification for vector graphics, or it might using DrawingML. It might use the line spacing defined in WordProcessingML, or it might have undefined legacy compatibility overrides for Word 95. It might have all of its content in XML, or it might have it mostly in RTF, HTML, MHTML, or âplain textâ. Or it may have any mix of the above. Even the most basic application that reads OOXML will also need to be conversant in RTF, HTML and MHTML.”
There you go. Binary blobs in OOXML documents, without there necessarily having been any graphics or multimedia content embedded.
So it turns out that you are 100% dead straight out wrong. Utterly wrong. Wrong in every respect. As wrong as wrong can be.
Now, finally, do YOU understand?
Thumbs up to this one for an interesting implementation, and smart use of xslt. What would be really nice is if they could build the necessary OO libs in to a PHP extension – imagine seeing some of the millions of PHP frameworks/CMSs/Blogs/etc start using something like this.
Does it handle the Pages native format?
All these converters are useless while people keep doing stupid things:
Aunt Joan Q. Average want’s to send a video clip to Joe Q. Sixpack. She opens “Word”, copies and pastes the video, and clicks on “send via e-mail”, which produces some MICROS~1 memory garbage stuff (non RFC conform).
Student Timmybob Dumb is writing a paper for his exam. He uses the bold, italics, and underline functions along with font faces and font size to structure his document. Of course, it does not contain a titlepage and a table of contents. But most of the text is in “Comic MS” because it looks funny. Imagine oh his joy if he would want to restructure his text! Wow, how big the files get, the contents must be good! Furthermore, he draws his formulas with “Paint” and needs a DVD to take his document to his professor.
Chief analyst Chester R. T. Fullbrain got the mentioned video clip from Aunt Joan. He includes the DOC file in a PPT presentation, but because he’s clever, he’s making a RAR archive out of it and then he embeds the RAR archive in an “Excel” file which he sends to his deskmate.
Don’t tell me anything, I’ve seen it all. đ
Most people use “Word” as a better typewriter, no matter if they’re employed at the ministry of finance or if they’re just doing homework for school. They don’t know of (and don’t care about) simple functionalities like document templates and paragraph formatting. And why? Because they have never heard about. They god a pirated copy of some older “Word” version (“Word ’97” or “Word 2000”) and are happy with it. (At least, that’s a common fact in Germany.)
So you better use catdoc and typeset it properly with LaTeX. đ
BTW, as it was mentioned before, working converters are a good way to help people migrating to open standard formats. It’s not very interesting for home PCs because the documents created there do not need to exist for a longer time; but in corporate settings it might be the right approach. Because if you have migrated to a standard, you can go anywhere with your documents. That’s what companies should be interested in, because the IT infrastructure as they know it will not be available for a longer time, so they should decide now where they want to be in the future – with their data.