On typos and misspellings

This week I’m in a release mood, so I’m releasing several projects I’m involved with. If you lost the first two, checkout dietsplash 0.3 and genslide 0.3 (though the announcement was in Portuguese).

After developing for several projects I’ve noticed most of them contain typos and misspellings. Even if this does not directly affect the source quality (unless the misspellings are in documentation), if we left the comment there, we’ve left it for a reason: because we want the reader of that code to stop and read it. It’s particularly good to have the correct spelling of each word when there are people from several parts of the word that maybe do not have English as their mother tongue (as I don’t). This way we can be more sure the correct message is being given through code, comments and documentation.

Thinking a bit on this I made some bash and awk scripts to fix misspellings based on the list of common misspellings available on wikipedia. I’ve successfully sent patches for projects like the Linux kernel, ConnMan, oFono and EFL. After some of them were accepted and after I decided to run the scripts again, I noticed how slow they were (if you are curious what they did, you can google on the oFono mailing list, in which I explain the scripts). So, I started a new, very short project: codespell. Measuring against the Linux kernel tree, it runs circa 20x faster than the previous scripts. Its current version is 1.0-rc1 and I’d like to have some more testers before I release the final 1.0.

Codespell is designed to fix misspellings in source code, but it can be applied to any type of text files. When possible, codespell will automatically fix the misspelling. Otherwise it will give some suggestions about possible changes. For example, running against the Linux kernel tree, it gives me several lines like below:

drivers/target/target_core_transport.c:2528: competion ==> competition, completion

drivers/edac/cpc925_edac.c:186: MEAR ==> wear, mere, mare

WARNING: Decoding file drivers/hid/hid-pl.c

WARNING: using encoding=utf-8 failed.

WARNING: Trying next encoding: iso-8859-1WARNING: Decoding file drivers/hid/hid-pl.cWARNING: using encoding=utf-8 failed. WARNING: Trying next encoding: iso-8859-1

drivers/net/niu.c:3276: clas ==> class | disabled because of name clash in c++

FIXED: ../kernel/drivers/scsi/aacraid/aacraid.h

FIXED: ../kernel/drivers/scsi/lpfc/lpfc_sli.c

FIXED: ../kernel/drivers/scsi/aacraid/aacraid.hFIXED: ../kernel/drivers/scsi/lpfc/lpfc_sli.c

(This is all in beautiful colored lines! Test it to see the true output)

The first two illustrate some changes that cannot be automatically done because that misspelling is a common one for more than one word. So, codespell gives you the file and line where they occur.

The WARNINGs are related to the encoding of the file. Codespell will default to parse files in UTF-8 encoding, which will handle ‘ascii’ as well. If it fails to decode any line, it will try the next available encoding, i.e. ISO-8859-1. Using these two encodings I have successfully ran codespell with all the projects I care about.

Codespell allows some changes to be disabled. This is shown by the “clas => class” fix, that are not always safe to do because of name clash with C++ code.

The lines prefixed with “FIXED” show the files that were automatically fixed. In current Linus’ master branch, this resulted in:

2545 files changed, 5007 insertions(+), 5007 deletions(-)

These were the automatic fixes, that may contain some false positives. The funniest one is the on found in Documentation/DocBook/kernel-hacking.tmpl:

/*

* Sun people can’t spell worth damn. “compatability” indeed.

* At least we *know* we can’t spell, and use a spell-checker.

*/

As can be seen by the number above, this is not really true ;-).

So, there it’s: **codespell 1.0-rc1**. Get it. Test it. Report problems. Tell me about projects that were successfully patched.

blogroll

social