File::CountLines: Efficiently count the number of line breaks in a file.

SYNOPSIS

    use File::CountLines qw(count_lines);
    my $no_of_lines = count_lines('/etc/passwd');

    # other uses
    my $carriage_returns = count_lines(
            'path/to/file.txt',
            style   => 'cr',
        );
    # possible styles are 'native' (the default), 'cr', 'lf'

DESCRIPTION

perlfaq5 answers the question on how to count the number of lines in a file. This module is a convenient wrapper around that method, with additional options.

More specifically, it counts the number of line breaks rather than lines. On Unix systems nearlly all text files end with a newline (by convention), so usually the number of lines and number of line breaks is equal.

Since different operating systems have different ideas of what a newline is, you can specifiy a \*(C`style\*(C' option, which can be one of the following values: This takes Perl's \*(C`\n\*(C' as the line separator, which should be the right thing in most cases. See perlport for details. This is the default. Take a carriage return as line separator (MacOS style) Take a line feed as line separator (Unix style) Take a carriage return followed by a line feed as separator (Microsoft Windows style)

Alternatively you can specify an arbitrary separator like this:

my $lists = count_lines($file, separator => '\end{itemize}');

It is taken verbatim and searched for in the file.

The file is read in equally sized blocks. The size of the blocks can be supplied with the \*(C`blocksize\*(C' option. The default is 4096, and can be changed by setting $File::CountLines::BlockSize.

Do not use a block size smaller than the length of the separator, that might produce wrong results. (In general there's no reason to chose a smaller block size at all. Depending on your size a larger block size might speed up things a bit.)

Character Encodings

If you supply a separator yourself, it should not be a decoded string.

The file is read in binary mode, which implies that this module works fine for text files in ASCII-compatible encodings, including \s-1ASCII\s0 itself, \s-1UTF-8\s0 and all the ISO-8859-* encodings (aka Latin-1, Latin-2, ...).

Note that the multi byte encodings like \s-1UTF-32\s0, UTF-16le, UTF-16be and \s-1UCS-2\s0 encode a line feed character in a way that the 0x0A byte is a substring of the encoded character, but if you search blindly for that byte you will get false positives. For example the \s-1LATIN\s0 \s-1CAPITAL\s0 \s-1LETTER\s0 C \s-1WITH\s0 \s-1DOT\s0 \s-1ABOVE\s0, U+010A has the byte sequence \*(C`0x0A 0x01\*(C' when encoded as UTF-16le, so it would be counted as a newline. Even search for \*(C`0x0A 0x00\*(C' might give false positives.

So the summary is that for now you can't use this module in a meaningful way to count lines of text files in encodings that are not ASCII-compatible. If there's demand for, I can implement that though.

Extending

You can add your own \s-1EOL\s0 styles by adding them to the %File::CountLines::StyleMap hash, with the name of the style as hash key and the separator as the value.

AUTHOR

Moritz Lenz <http://perlgeek.de>, <mailto:[email protected]>

COPYRIGHT AND LICENSE

Example code included in this package may be used as if it were in the Public Domain.

DEVELOPMENT

You can obtain the latest development version from <http://github.com/moritz/File-CountLines>:

git clone git://github.com/moritz/File-CountLines.git

File::CountLines (3pm)