How to clean up "strange" Microsoft Word characters with a JEdit macro
Today we're going to see how to "clean" documents produced by Microsoft Word, the popular word processor.
The problem
Often Word documents contain "strange" characters, that cause problems to other applications, too. An example is creating articles for a blog or a website. Many are tempted to create them on their PC using Word, applying formatting and styles (a very bad habit I would say...) and copy'n'pasting text on their CMS editor. This workflow is flawed in many ways: other than creating problems (inconsistent formatting, invalid html code, non-standard tags, etc.) puts a light on a less-known problem of this program (and others): the presence of unusual characters, like various types hyphens and dashes (the so-called "Em Dash"), suspension dots condensed in a single character, etc.
Here is a sample of these fuc... ehm annoying characters, choosen among the most common:
-
–
-
—
-
‘
-
’
-
“
-
”
-
•
-
…
The problem is not related to validation or visual appearance of them, in fact they are beautiful and professional-looking, but unfortunately they're not supported in many character sets (also called "font"), or they generate discrepancies when translated between different encodings. An annoyance, not a tragedy for sure.
A macro to rule them all...
To solve this problem, at least on simple text files, I decided to create a JEdit Macro. This incredible editor can be literally "programmed" using Beanshell (a scripting language with a "relaxed" Java syntax). I already talked about JEdit in some popular articles:
Now, follow these steps:
1) create o download my macro "CleanWordChars.bsh" and place it in "macro" directory under JEdit installation directory (important: must be saved with .bsh extension!). Here's the source:
/* * CleanWordChars.bsh - a BeanShell macro that cleans weird characters produced * by Microsoft Word, applying some substitutions. * * Copyright (C) 2009 De Franciscis Dimitri, http://www.megadix.it/ * */ import java.util.regex.*; megadix_cleanWordChars() { try { buffer.beginCompoundEdit(); String source = textArea.getText(); Pattern p = Pattern.compile( "([\u2013\u2014\u2018\u2019\u201C\u201D\u2022\u2026]){1,1}", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(source); sb = new StringBuffer(); while(m.find()) { g = m.group(1); if (g.equals("\u2013") || g.equals("\u2014")) { m.appendReplacement(sb, "-"); } else if (g.equals("\u2018") || g.equals("\u2019")) { m.appendReplacement(sb, "'"); } else if (g.equals("\u201C") || g.equals("\u201D")) { m.appendReplacement(sb, "\""); } else if (g.equals("\u2022")) { m.appendReplacement(sb, "*"); } else if (g.equals("\u2026")) { m.appendReplacement(sb, "..."); } } m.appendTail(sb); textArea.setText(sb.toString()); } finally { buffer.endCompoundEdit(); } } megadix_cleanWordChars();
2) Clic now on Macros / Rescan macros: you'll se the new macro in the list, so you can immediately start using it.
Now you can experiment freely:
- take a Word file with those damned characters;
- copy text on a new JEdit file;
- execute the macro!
Here's the magic: "strange" characters have been replaced with simpler ones: plain "minus", aterisks and multiple dots instead of "suspension dots" character.
Attached you can find a sample file for your experiments, and macro source code.
- dimitri's blog
- Aggiungi un commento
- 934 letture
- Italian

