How can I count number of lines of actual comments?

803 Views Asked by At

I have a bunch of MATLAB script/function files that I and the rest of my team need to work on. We have little to no idea what most of the files do, and little to no idea which ones belong together and which ones are separate. We do know we have a total of 36,000 lines. I'd like to know how many of those lines are comments.

Easy, right? Just count how many of them start with the comment start character %.

Well, no. I don't want to count blocks of code that have been commented out as "comments", since they don't actually tell me anything. And I'd prefer not to count "empty" lines used to make one comment line a "headline"

% %%%%%%%%
% headline
% %%%%%%%%

like so.

So how can I get a sensible estimate of how many lines of actual informative comments I have? Is there an easy way to distinguish natural language (possibly containing code snippets) from pure code?


Yes, I know code should be self-explanatory as far as is practical, but the code we have inherited clearly is not. Yes, I know we should probably refactor this mess. The purpose of figuring out how much comments we have is to highlight the technical debt we have here, so that we can allocate resources to this refactoring.

3

There are 3 best solutions below

5
On

We can use the semi-documented mtree utility for this.

Let's take for example the .m file that contains the definition of the mtree class itself.

dbtype mtree yields (this is just the beginning):

1     classdef mtree
2     %MTREE  Create and manipulate M parse trees
3     %   This is an experimental program whose behavior and interface is likely
4     %   to change in the future.
5     
6     % Copyright 2006-2016 The MathWorks, Inc.
7     
8         properties (SetAccess='protected', GetAccess='protected', Hidden)
9             T    % parse tree array

Now, if we invoke the mtree utility on itself and show the result as text,

tree = mtree('mtree.m','-file');
tree.dumptree()

here's what we get (again, just the beginning):

  1  *:  CLASSDEF:   1/01 
  3     *Cexpr:  ID:   1/10  (mtree)
  4     *Body:  PROPERTIES:   8/05 
  5        *Attr:  ATTRIBUTES:   8/16 
  6           *Arg:  ATTR:   8/26 
  7              *Left:  ID:   8/17  (SetAccess)
  8              *Right:  CHARVECTOR:   8/27  ('protected')
  9           >Next:  ATTR:   8/49 
 10              *Left:  ID:   8/40  (GetAccess)
 11              *Right:  CHARVECTOR:   8/50  ('protected')
 12           >Next:  ATTR:   8/63 
 13              *Left:  ID:   8/63  (Hidden)
 14        *Body:  EQUALS:   9/09 

As you can see from the above, comment and empty lines (2-7) do not appear on the left side of the "fractions" in the output.. So if we find a way to get the "numerators", we'll get the numbers of the lines that contain actual code.

We're in luck, since there exists a method that gives us these numerators - lineno! So if we call it and apply unique to the output, we'll get exactly one copy of each line:

uLines = unique(tree.lineno);
nCodeLines = numel(uLines);

This yields a value of 269 for nCodeLines in R2018b. If you're willing to assume that the last line in a file is always a line of code (and not a comment or a blank), you can just subtract nCodeLines from the last element of uLines to get the amount of comment lines (121 in this case). Otherwise, use some other technique to count the total number of lines (example).

All that's left is to write this as a function and feed the folder of .m files to it :)

0
On

It is easy to get comments that aren't just to separate stuff by excluding everything that doesn't contain any text: a-z or A-Z. So, %a is "informative comment", while %----- isn't.

Now, to filter out code, I believe the best way would be to consider %text text as a comment and the rest as code: comment is where there is a space between two pieces of text. Piece of text could be anything that contains a letter, or it might be restricted to stuff that are just letters and punctuation (in one case a=5 is a single piece of text, in the other it is a single piece of not text), and you should exclude reserved code words as well.

This obviously isn't going to work all the time as text having just a single word comment is also an informative comment, say you might have a comment saying %randomize. However, consider this: randomize could be a comment telling the stuff below does randomization (most likely), or it might be a function that does randomization without taking any parameters and giving any output (say by abusing reflection to actually do something). There is no way to parse between these two options - parser would have to run the code line by line and try whether the line does any work or produces an error to work in such scenarios.

Note that code won't work directly and isn't optimal either, but fixing it should be easy enough.

isC = parseComment(commentText)
splitText = split(commentText, ' '); % split by whitespace.
isValidText = false(length(splitText),1);
if (length(isValidText) == 1)
   isC = false;
   return
end
for i=1:length(splitText)
   % find if this "word" is valid non-code text.
   if (contains(splitText(i), [a-z])) % Fix this condition, should suitably check if the thingy is a word in a way you want it.
      if ~isReservedCodeWord(splitText(i)) % here you should exclude if, for, while and so on.
         isValidText(i) = true;
      end
   end
end
%checking of parts is complete, check if the string has 2 adjacent "valid text" parts.
isC = any(isValidText(1:end-1) & isValidText(2:end));
0
On

Answering my own question here, since I ended up going in a different direction than either of the answers.


I needed an estimate, not necessarily an exact number. I would have been fine with an automated system even if it did misclassify some lines. But I couldn't find a simple enough way to distinguish code from text, so I went with a more manual route.

I just greped all comment lines, and then scrambled the line order of the output, so that I could look at the last 50 or so lines on my screen, and manually count the ratio of useful comments to commented code. This gives me a rough estimate, and multplying that with the number of lines of comments gives me an estimate of the number of useful lines of comments.

The conclusion is that we have about 36000 lines of almost entirely undocumented code to play with. Yay.

To scramble the line order, I used a shuffle.bat file that I found here How to randomly rearrange lines in a text file using a batch file

So I ended up with type *.m | grep % | shuffle.bat

And that was good enough for me.


The suggestion by Dev-il to use mtree would have been super useful if mtree could be made to output the number of lines of parseable, runnable, code. Then I could have grep ed out the comment lines, stripped the % at the start, and then used mtree to count what is runnable code and what is most likely text. Unfortunately, mtree will parse anything, and doesn't really distinguish between stuff that ends up making sense as code and stuff that doesn't.