First steps with Haskell – The word counter

Introduction:
I was lately blessed with an array of coursework assignments for a module called “Paradigms of Programming”. The language I picked to get into functional programming was Haskell and I had a few fun afternoons with it programming a word counter which counts both the total number of words as well as the frequency of words in a given text file.
I’m hoping others might find the resulting code helpful or interesting :)

Preparations:

  • Get Haskell
  • Get MissingH from here
  • For an overview of Haskell have a look at Wikipedia
  • If you are trying Haskell for the first time you might want to spend a few minutes with this interactive tutorial
  • This is my source code and count.txt file for the example below: Haskell word counter – source/count.txt
  • If you want to know more about the commands I used click on the commands in the code below and they’ll send you to the Haskell code reference for the command

 

The Code – Stage 1: Word count
Step 1: Let’s be nice and talk to the human:
putStrLn “Welcome to the exciting world of Haskell word counting!”
putStrLn “Please enter the full name of the file which contains the words you want counted:”

Step 2: Then we get the file name from our user and store it in “name”
name <- getLine
putStrLn $ “The file name you have entered is: ” ++ name

Step 3: let’s read the file specified in “name” into “contents”
contents <- readFile name

Step 4: let’s get rid of punctuation (e.g. . or !) which are currently attached to the end of words (thishadafullstop. -> thishadafullstop)
let contents2 = replace “.” “” contents
let contents3 = replace “!” “” contents2

Step 5: let’s make the words all lower case to ensure .Later = later
let lower = map toLower contents3

Step 6: let’s feed the files content to “words” which will chop it down and return us a nice clean array of the contained words
let chop = words (lower)

Step 7: let’s take the result of step 4 and feed it to “length” which will return the number of words in the array and add a “/n” at the end to make it pretty…
let count = show (length chop) ++ “\n”

Step 8: let’s throw a pretty result at the human crouched in front of the screen
putStrLn $ “This wonderful Haskell program has found ” ++ count ++ “words in count.txt”

 

The Code – Stage 2: Word frequency

Step 1: let’s “sort” that lot and get all the identical words together (Example: ["a","a","a","a","a","about","also","an","and","and","and","and","and","api"])
let chop2 = sort chop

Step 2: now we “group” all instances of each word into a separate list (Example: [["a","a","a","a","a"],["about"],["also"],["an"],["and","and","and","and","and"],["api"]])
let chop3 = group chop2

Step 3: for each list x we use “length” to see how many instances we’ve got in the list and “head” to get one example of content (Example: ["a","a","a","a","a"],["about?"] -> [5,"a"], [1,"about"])
let chop4 = map (\x -> (length x, head x)) chop3

Step 4: “reverse $ sort” will sort our list with the most frequent words at the beginning and descending order
let chop5 = reverse (sort chop4)

 

Stage 3: Let’s filter out the Top20 words in our document

Step 1: Exclusions, not much point in building a top20 list without this step unless we want to prove that this will be very similar across most longer texts… (http://en.wikipedia.org/wiki/Most_common_words_in_English)
let exclusion_1 = “the”
let exclusion_2 = “be”
let exclusion_3 = “to”
let exclusion_4 = “of”
let exclusion_5 = “and”
let exclusion_6 = “a”
let exclusion_7 = “in”
let exclusion_8 = “that”
let exclusion_9 = “have”
let exclusion_10 = “I”
let exclusion_11 = “it”
let exclusion_12 = “for”
let exclusion_13 = “not”
let exclusion_14 = “on”
let exclusion_15 = “with”
let exclusion_16 = “he”
let exclusion_17 = “as”
let exclusion_18 = “you”
let exclusion_19 = “do”
let exclusion_20 = “at”

Step 2: with “filter” we can take out all the lists where the second tuple matches one of our exclusions
let chop6 = filter (\(x,y)-> y /= exclusion_1 && y /= exclusion_2 && y /= exclusion_3 && y /= exclusion_4 && y /= exclusion_5 && y /= exclusion_6 && y /= exclusion_7 && y /= exclusion_8 && y /= exclusion_9 && y /= exclusion_10 && y /= exclusion_11 && y /= exclusion_12 && y /= exclusion_13 && y /= exclusion_14 && y /= exclusion_15 && y /= exclusion_16 && y /= exclusion_17 && y /= exclusion_18 && y /= exclusion_19 && y /= exclusion_20) chop5

Step 3: let’s take the top20 words so we don’t throw an endless amount of stuff at our user
let chop7 = take 20 chop6

Step 4: let’s throw another pretty result at the human crouched in front of the screen
putStrLn $ “This wonderful Haskell program has also had a look at the ” ++ count ++ “words in ” ++ name ++ ” and after excluding the 20 most frequently used words as defined by wikipedia these are the most frequently used words in ” ++ name ++ “:”
let top20 = show chop7
putStrLn top20

Haskell word counter v2.1

Haskell word counter v2.1

 

Stage 4 will usually be commented out, I’ve merely fed it in to prove that the endless line generates the same output as the chopped up version above…
Stage 4: and just because we can, now all together in one messy line with tons a ()s and $s:

let alltogether = show $ take 20 $ filter (\(x,y)-> y /= exclusion_1 && y /= exclusion_2 && y /= exclusion_3 && y /= exclusion_4 && y /= exclusion_5 && y /= exclusion_6 && y /= exclusion_7 && y /= exclusion_8 && y /= exclusion_9 && y /= exclusion_10 && y /= exclusion_11 && y /= exclusion_12 && y /= exclusion_13 && y /= exclusion_14 && y /= exclusion_15 && y /= exclusion_16 && y /= exclusion_17 && y /= exclusion_18 && y /= exclusion_19 && y /= exclusion_20) (reverse $ sort $ map (\x -> (length x, head x)) (group $ sort $ words $ map toLower (replace “.” “” (replace “!” “” contents))))
putStrLn alltogether

 

Conclusion:

Yes, I admit growing up with Pascal and C :) Having said that, I’ve spent a lot of time with SQL over the last couple of years and parts of Haskell like “sort” and “group” felt very intuitive. Just like SQL or OCaml it took a little while to get into the flow of how things are done but the amount of testing required is much lower than with any imperative language I know which is a big plus on my lazy programmer list ;)

Thanks for rating this! Now tell the world how you feel - .
How does this post make you feel?
  • Excited
  • Fascinated
  • Amused
  • Bored
  • Sad
  • Angry
FacebookGoogle GmailEvernoteDeliciousShare

Add a comment »2 comments to this article

  1. Instead of manually testing all the exclusions, you can make a list

    let exclusions = ["the", "be", "to", "of", "and", "a", "in", "that", "have", "I", "it", "for", "not", "on", "with", "he", "as", "you", "do", "at"]

    Then, to check if y is not in this list, use y `notElem` exclusions.

    Reply

    • Thanks! That makes my endless exclusion section shorter, I’ll start playing with it :)

      Peter

      Reply


*

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress


Hit Counter provided by seo company