wdcnt -- word counter for English/Japanese text file.

SYNOPSIS

wdcnt [-p|-z] [-e] files ...

wdcnt [-p|-z] [-e] < file

wdcnt -v

wdcnt counts reports English or Japanese words in files or standard input. wdcnt ignores punctuation, digits, quote signs or HTML tags. The output is sorted in the order of the occurrence frequency and can be plotted directly by gnuplot(1) as follows.

gnuplot> set log xy
gnuplot> plot "< wdcnt file"

OPTIONS

-p: Reports probability instead of number of occurrences. Each frequency is normalized by 1.0.
-z: Reports relative frequency instead of number of occurrences. 1.0 for the most occurring word.
-e: Does not use KAKASI. This option is NOT useful to Japanese documents.
-v, -h: Prints usage and version then exit.

HISTORY

For English document, a traditional one-liner is known:

% tr -s '\040' '\012' files ... | sort -n | uniq -c | sort -n -r

BUGS

Word separation is not accurate.

AUTHOR

Gotoken <URL:mailto:gotoken@notwork.org>