Java Notes: Example - WordFrequency
This program reads files of words and lists their frequencies. A file of words to ignore can also be supplied. This example illustrates the use of generic data structures (Sets, Maps, and ArrayList), regular expressions (as used by Scanner), and Comparators.
Exercises. A command-line interface is provided, but an interesting exercise is to write a GUI interface.
A text command-line interface
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
// File : data-collections/wordfreq2/WordFrequencyCmd.java // Purpose: Main program to test WordCounter // Prints word frequency in source file. Ignores words in ignore file. // Uses Sets, Maps, ArrayList, regular expressions, BufferedReader. // Author : Fred Swartz - April 2007 - Placed in public domain. import java.io.*; import java.util.*; /////////////////////////////////////////////////////////////// WordFrequencyCmd class WordFrequencyCmd { //===================================================================== main public static void main(String[] unused) { Scanner in = new Scanner(System.in); try { //... Read two file names from the input. System.out.println("Name of file containing text to analyze:"); File inputFile = new File(in.nextLine()); System.out.println("Name of file containing words to ignore:"); File ignoreFile = new File(in.nextLine()); //... Supply two files to WordCounter. WordCounter counter = new WordCounter(); counter.ignore(ignoreFile); counter.countWords(inputFile); //... Get the results. String[] wrds = counter.getWords(WordCounter.SortOrder.BY_FREQUENCY); int[] frequency = counter.getFrequencies(WordCounter.SortOrder.BY_FREQUENCY); //... Display the results. int n = counter.getEntryCount(); for (int i=0; i<n; i++) { System.out.println(frequency[i] + " " + wrds[i]); } System.out.println("\nNumber of input words: " + counter.getWordCount()); System.out.println("Number of unique words: " + n); } catch (IOException iox) { System.out.println(iox); } } } |
The "Model" -- Word frequency without any user interface
The model (the logic of the program without any user interface) is implemented primarily in the WordCounter class, which uses utility classes: ComparatorFrequency and ComparatorAlphabetic to sort, and Int to record the frequency counts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
// File : data-collections/wordfreq2/WordCounter.java // Purpose: Provides methods to read ingore file, input file, get results. // Computes the frequency for each word. // Author : Fred Swartz - April 2007 - Placed in public domain. import java.io.*; import java.util.*; import java.util.regex.*; /** Computes word frequency in source file; ignores words in ignore file. * Uses generic Sets, Maps, ArrayLists, regular expressions, Scanner. * @author Fred Swartz * @version 2007-05-06 */ public class WordCounter { //================================================================ constants private static final Comparator<Map.Entry<String, Int>> SORT_BY_FREQUENCY = new ComparatorFrequency(); private static final Comparator<Map.Entry<String, Int>> SORT_ALPHABETICALLY = new ComparatorAlphabetic(); public enum SortOrder {ALPHABETICALLY, BY_FREQUENCY} //=================================================================== fields Set<String> _ignoreWords; // Words to ignore. Map<String, Int> _wordFrequency; // Words -> frequency int _totalWords; // Total source words. //============================================================== constructor /** Constructor */ public WordCounter() { _ignoreWords = new HashSet<String>(); _wordFrequency = new HashMap<String, Int>(); _totalWords = 0; } //=================================================================== ignore /** * Reads file of words to ignore. Ignore words are added to a Set. * The IOException is passed to caller because we certinaly don't * know what the user interface issue is. * * @param ignoreFile File of words to ignore. */ public void ignore(File ignoreFile) throws IOException { Scanner ignoreScanner = new Scanner(ignoreFile); ignoreScanner.useDelimiter("[^A-Za-z]+"); while (ignoreScanner.hasNext()) { _ignoreWords.add(ignoreScanner.next()); } ignoreScanner.close(); // Close underlying file. } //=================================================================== ignore /** * Takes String of words to ignore. Ignore words are added to a Set. * * @param ignore String of words to ignore. */ public void ignore(String ignoreStr) { Scanner ignoreScanner = new Scanner(ignoreStr); ignoreScanner.useDelimiter("[^A-Za-z]+"); while (ignoreScanner.hasNext()) { _ignoreWords.add(ignoreScanner.next()); } } //=============================================================== countWords /** Record the frequency of words in the source file. * May be called more than once. * IOException is passed to caller, who might know what to do with it. *@param File of words to process. */ public void countWords(File sourceFile) throws IOException { Scanner wordScanner = new Scanner(sourceFile); wordScanner.useDelimiter("[^A-Za-z]+"); while (wordScanner.hasNext()) { String word = wordScanner.next(); _totalWords++; //... Add word if not in map, otherewise increment count. if (!_ignoreWords.contains(word)) { Int count = _wordFrequency.get(word); if (count == null) { // Create new entry with count of 1. _wordFrequency.put(word, new Int(1)); } else { // Increment existing count by 1. count.value++; } } } wordScanner.close(); // Close underlying file. } //=============================================================== countWords /** Record the frequency of words in a String. * May be called more than once. *@param String of words to process. */ public void countWords(String source) { Scanner wordScanner = new Scanner(source); wordScanner.useDelimiter("[^A-Za-z]+"); while (wordScanner.hasNext()) { String word = wordScanner.next(); _totalWords++; //... Add word if not in map, otherewise increment count. if (!_ignoreWords.contains(word)) { Int count = _wordFrequency.get(word); if (count == null) { // Create new entry with count of 1. _wordFrequency.put(word, new Int(1)); } else { // Increment existing count by 1. count.value++; } } } } //============================================================= getWordCount /** Returns number of words in all source file(s). *@return Total number of words proccessed in all source files. */ public int getWordCount() { return _totalWords; } //============================================================ getEntryCount /** Returns the number of unique, non-ignored words, in the source file(s). * This number should be used to for the size of the arrays that are * passed to getWordFrequency. *@return Number of unique non-ignored source words. */ public int getEntryCount() { return _wordFrequency.size(); } //========================================================= getWordFrequency /** Stores words and their corresponding frequencies in parallel array lists * parameters. The frequencies are sorted from low to high. * @param words Unique words that were found in the source file(s). * @param counts Frequency of words at corresponding index in words array. */ public void getWordFrequency(ArrayList<String> out_words, ArrayList<Integer> out_counts) { //... Put in ArrayList so sort entries by frequency ArrayList<Map.Entry<String, Int>> entries = new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet()); Collections.sort(entries, new ComparatorFrequency()); //... Add word and frequency to parallel output ArrayLists. for (Map.Entry<String, Int> ent : entries) { out_words.add(ent.getKey()); out_counts.add(ent.getValue().value); } } //================================================================= getWords /** Return array of unique words, in the order specified. * @return An array of the words in the currently selected order. */ public String[] getWords(SortOrder sortBy) { String[] result = new String[_wordFrequency.size()]; ArrayList<Map.Entry<String, Int>> entries = new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet()); if (sortBy == SortOrder.ALPHABETICALLY) { Collections.sort(entries, SORT_ALPHABETICALLY); } else { Collections.sort(entries, SORT_BY_FREQUENCY); } //... Add words to the String array. int i = 0; for (Map.Entry<String, Int> ent : entries) { result[i++] = ent.getKey(); } return result; } //=========================================================== getFrequencies /** Return array of frequencies, in the order specified. * @return An array of the frequencies in the specified order. */ public int[] getFrequencies(SortOrder sortBy) { int[] result = new int[_wordFrequency.size()]; ArrayList<Map.Entry<String, Int>> entries = new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet()); if (sortBy == SortOrder.ALPHABETICALLY) { Collections.sort(entries, SORT_ALPHABETICALLY); } else { Collections.sort(entries, SORT_BY_FREQUENCY); } //... Add words to the String array. int i = 0; for (Map.Entry<String, Int> ent : entries) { result[i++] = ent.getValue().value; } return result; } } |
Comparators
The key-value (Map.Entry
) pairs in a HashMap
are not sorted, but
users would certainly like to see them sorted, either alphabetically
by word, or by frequency.
Because sorts are usually based on comparison of two values at a time,
all that is needed is a way to compare two values.
That's what a Comparator
does -- it defines a compare()
method
that tells how two values compare.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
// File : data-collections/wordfreq2/ComparatorAlphabetic.java // Purpose: A comparator to sort Map.Entries alphabetically. // Author : Fred Swartz - March 2005 - Placed in public domain. import java.util.*; /////////////////////////////////////////////////////// class ComparatorAlphabetic /** Order words alphabetically. */ class ComparatorAlphabetic implements Comparator<Map.Entry<String, Int>> { public int compare(Map.Entry<String, Int> entry1 , Map.Entry<String, Int> entry2) { return entry1.getKey().compareTo(entry2.getKey()); } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
// File : data-collections/wordfreq2/ComparatorFrequency.java // Purpose: A comparator to sort Map.Entries by frequency. // Author : Fred Swartz - April 2007 - Placed in public domain. import java.util.*; ////////////////////////////////////////////////////// class ComparatorFrequency /** Order words from least to most frequent, put ties in alphabetical order. */ class ComparatorFrequency implements Comparator<Map.Entry<String, Int>> { public int compare(Map.Entry<String, Int> obj1 , Map.Entry<String, Int> obj2) { int result; int count1 = obj1.getValue().value; int count2 = obj2.getValue().value; if (count1 < count2) { result = -1; } else if (count1 > count2) { result = 1; } else { //... If counts are equal, compare keys alphabetically. result = obj1.getKey().compareTo(obj2.getKey()); } return result; } } |
Mutable Integers
Because only objects (not primitives) can be stored in Collections data structures,
the integer count must be an object, not just an int
. The natural choice would be the
Integer
wrapper type, but it is immutable, which means that once an
Integer
object is created, its value can never be changed. But here we need
to update the value, so here's a class that stores an int that can be changed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
// File : data-collections/wordfreq2/Int.java // Purpose: Simple value class to hold a mutable int. // Author : Fred Swartz - March 2005 - Placed in public domain. //////////////////////////////////////////////////////////////// value class Int /** Utility class to keep int as Object but allow changes (unlike Integer). * Java collections hold only Objects, not primitives, but need to update value. * The intention is that the public field should be used directly. * For a simple value class this is appropriate. */ class Int { //=================================================================== fields public int value; // Just a simple PUBLIC int. //============================================================== constructor /** Constructor @param value Initial value. */ public Int(int value) { this.value = value; } } |