Count Lines of Code, Comments and Whitespace in Source Files and Archives

Counts blank lines, comment lines, and physical lines of source code in source files/trees/archives. An R wrapper to the Perl cloc utility https://github.com/AlDanial/cloc by @AlDanial.

How it works

cloc's method of operation resembles SLOCCount's: First, create a list of files to consider. Next, attempt to determine whether or not found files contain recognized computer language source code. Finally, for files identified as source files, invoke language-specific routines to count the number of source lines.

A more detailed description:

If the input file is an archive (such as a .tar.gz or .zip file), create a temporary directory and expand the archive there using a system call to an appropriate underlying utility (tar, bzip2, unzip, etc) then add this temporary directory as one of the inputs. (This works more reliably on Unix than on Windows.)
Use perl's File::Find to recursively descend the input directories and make a list of candidate file names. Ignore binary and zero-sized files.
Make sure the files in the candidate list have unique contents (first by comparing file sizes, then, for similarly sized files, compare MD5 hashes of the file contents with perl's Digest::MD5). For each set of identical files, remove all but the first copy, as determined by a lexical sort, of identical files from the set. The removed files are not included in the report.
Scan the candidate file list for file extensions which cloc associates with programming languages. Files which match are classified as containing source code for that language. Each file without an extensions is opened and its first line read to see if it is a Unix shell script (anything that begins with #!). If it is shell script, the file is classified by that scripting language (if the language is recognized). If the file does not have a recognized extension or is not a recognzied scripting language, the file is ignored.
All remaining files in the candidate list should now be source files for known programming languages. For each of these files:
1. Read the entire file into memory.
2. Count the number of lines (= L original).
3. Remove blank lines, then count again (= L non-blank).
4. Loop over the comment filters defined for this language. (For example, C++ as two filters: (1) remove lines that start with optional whitespace followed by // and (2) remove text between /* and */) Apply each filter to the code to remove comments. Count the left over lines (= L code).
5. Save the counts for this language:
  - blank lines = L original - L non-blank
  - comment lines = L non-blank - L code
  - code lines = L code