Sloppy

This tool will scan all source code in a folder and generate a report on how “sloppy” the code is… sloppiness being a measurement of a repetitive code style: under abstraction (copy / pasting) and over abstraction (pointless complexity).

It does its analysis on a token level, i.e. strips away all white-space & comments, and parses all syntactical elements into its smallest indivisible units, such as "if" "(" "count" "=" "100" ")", etc.

It then finds matches across all code, a match being either a verbatim copy of the same set of tokens, or with minor token differences of 0 to 2 tokens at a time. E.g. two blocks of code where in one “-x” has been changed into “y” are still considered a match.

It computes a sloppiness factor for each match, which is a magic formula that takes into account:

  • The length of the match. This is by far the biggest contributor, as the score goes up exponentially with it.
  • The amount tokens skipped while producing the match. More skips means more differences, so harder to refactor than verbatim matches.
  • The distribution of those skips
  • The distance between the two matches:
    • matches in separate files are considered worse than matches in the same file.
    • matches very close together are considered less harmfull.
    • matches on adjacent lines are considered the least harmfull, i.e. they make the redundancy very visually obvious.
  • whether tokens contained in the match are part of a function body or top level declaration, the former contributes more strongly to sloppiness. It assumes a function body is the first { } pair not preceded by any of the keywords class struct interface namespace enum.

Sloppy then outputs the 10 highest scoring matches, indicating on which lines in which files they occur, and a sample of the offending code. This is the lowest hanging fruit you can focus on to improve code. Simply keep re-running the tool after changes until you can’t find anything to improve anymore.

At the end, a “sloppiness per token” ratio is output, which you can use as a single number representing current code quality. The number does not mean much in and of itself, though I did tweak it such that 1 means well factored code, and higher is worse, but your mileage may vary. It is mostly useful to see how one code base improves over time, or to compare code bases of different authors implementing the exact same feature set, such as with students.

Think of it as a game: the 10 matches are the available quests, once you have completed them, hand them in (re-run), and be rewarded with a better score. New quests may become available then. Beat your high score, and those of others! :)

Supported Languages

The tool sofar has been geared towards C syntax style languages, i.e. it supports C/C++ (extensions c/h/cpp/hpp/cxx/cc/inl), other C like languages (cs/java/php/js/as/m/fx) should work great (only C# tested sofar). I also added support for reading py/rb/pl/lua extensions, but I should add support for other comment styles before these become useful. You can force the program to scan for a custom extention with the -c option, e.g. -cbf

It tokenizes in a generic fashion, assuming it is reading code which the compiler for that language already approves of:

  • Single // and multi line /* */ comments (not nested)
  • Strings with " or ', and with \" or \' escape characters & unicode prefix.
  • Numbers, supporting C style ints, floats, hex, octal, suffixes, scientific notation etc.
  • Identifiers made out of alpha numeric characters and _
  • Punctuation symbols from the set ()[]{},;
  • Operators from the remaining symbols. It assumes that if a set of these symbols ends in = or >, or is made up of all he same characters, it is a compound operator, otherwise they are individual operators. This covers the operators in pretty much all C like languages and beyond.

Running the Tool

It comes as a single file: sloppy.exe, which you can either place in the root of your project, or you can place it somewhere else and put a shortcut to it in your project folder (with no path).

DOWNLOAD (last release: 7 nov 2010)

Contains a win32 exe, + source code that compiles on *nix (thanks Veselin Georgiev!).

By default it scans the current directory, alternatively you can pass any number of folders as command-line arguments, and it will scan them all. Use the -e command line option to exclude any files or folders that match a substring, e.g. -eOLD (case sensitive).

By default it scans recursively, pass the option -r to turn this off. The -o option allows you to change the number of worst items displayed from the default of 10, e.g. -o20

When you run the program, you’ll be presented with some simple console output of matches found. Here is a sample output of the program (cut down to 5 matches):

worst offenders:

44 tokens & 3 skips (1391 sloppiness, 4.06% of total) starting at:
=> C:<br/>W<br/>treesheets<br/>src<br/>grid.h:270
=> C:<br/>W<br/>treesheets<br/>src<br/>cell.h:383
FindLink ( Selection & s , Cell * link , Cell * best , bool & lastthis , bool & stylematch )
{ foreachcell ( c ) best = c -> FindLink ( s , link , best , lastth is , stylematch ) ;

26 tokens & 0 skips (862 sloppiness, 2.52% of total) starting at:
=> C:<br/>W<br/>treesheets<br/>src<br/>document.h:493
=> C:<br/>W<br/>treesheets<br/>src<br/>document.h:469
p = drawroot -> parent ; p ; p  p -> parent ) if ( p -> text . t . Len ( ) )

30 tokens & 1 skips (780 sloppiness, 2.28% of total) starting at:
=> C:<br/>W<br/>treesheets<br/>src<br/>mycanvas.h:80
=> C:<br/>W<br/>treesheets<br/>src<br/>mycanvas.h:34
wxMouseEvent & me ) { wxClientDC dc ( this ) ; UpdateHover ( me . GetX ( ) , me . GetY ( ) , dc ) ; Status (

36 tokens & 1 skips (718 sloppiness, 2.10% of total) starting at:
=> C:<br/>W<br/>treesheets<br/>src<br/>grid.h:652
=> C:<br/>W<br/>treesheets<br/>src<br/>grid.h:648
pc = p -> C ( x + s . x , y + s . y ) ; if ( pc -> HasContent ( ) ) { if ( y ) p  -> InsertCells (

34 tokens & 2 skips (508 sloppiness, 1.49% of total) starting at:
=> C:<br/>W<br/>treesheets<br/>src<br/>grid.h:362
=> C:<br/>W<br/>treesheets<br/>src<br/>grid.h:357
tl = C ( s . x , s . y ) ; return wxRect ( tl -> GetX ( doc ) , tl -> GetY ( doc ) , 0 , tl ->

summary:

total tokens: 41475
total sloppiness: 34232
sloppiness / token ratio: 0.83
press enter to continue...