.net - LINQ method for adding items to a dictionary -
i'm trying learn bit more linq implementing peter norvig's spelling corrector in c#.
the first part involves taking large file of words (about 1 million) , putting dictionary key
word , value
number of occurrences.
i'd so:
foreach (var word in allwords) { if (wordcount.containskey(word)) wordcount[word]++; else wordcount.add(word, 1); }
where allwords
ienumerable<string>
in linq i'm doing this:
var wordcountlinq = (from word in allwordslinq group word word groups select groups).todictionary(g => g.key, g => g.count());
i compare 2 dictionaries looking @ <key, value>
, they're identical, they're producing same results.
the foreach
loop takes 3.82 secs , linq query takes 4.49 secs
i'm timing using stopwatch class , i'm running in release mode. don't think performance bad wondering if there reason difference.
am doing linq query in inefficient way or missing something?
update: here's full benchmark code sample:
public static void testcode() { //file can downloaded http://norvig.com/big.txt , consists of million words. const string filename = @"path_to_file"; var allwords = match m in regex.matches(file.readalltext(filename).tolower(), "[a-z]+", regexoptions.compiled) select m.value; var wordcount = new dictionary<string, int>(); var timer = new stopwatch(); timer.start(); foreach (var word in allwords) { if (wordcount.containskey(word)) wordcount[word]++; else wordcount.add(word, 1); } timer.stop(); console.writeline("foreach loop took {0:0.00} ms ({1:0.00} secs)\n", timer.elapsedmilliseconds, timer.elapsedmilliseconds / 1000.0); //make linq use different enumerable (with same values), //if don't becomes way faster, assmume caching thing?? var allwordslinq = match m in regex.matches(file.readalltext(filename).tolower(), "[a-z]+", regexoptions.compiled) select m.value; timer.reset(); timer.start(); var wordcountlinq = (from word in allwordslinq group word word groups select groups).todictionary(g => g.key, g => g.count()); timer.stop(); console.writeline("linq took {0:0.00} ms ({1:0.00} secs)\n", timer.elapsedmilliseconds, timer.elapsedmilliseconds / 1000.0); }
one of reasons linq version slower, because instead of 1 dictionary, 2 dictionaries created:
(internally) group operator; group stores each individual word. can verify looking @ toarray() rather count(). lot of overhead don't need in case.
the todictionary method foreach on actual linq query, results query added new dictionary. depending on number of unique words, can take time.
another reason linq query little slower, because linq relies on lambda expressions (the delegate in dathan's answer), , calling delegate adds tiny amount of overhead compared inline code.
edit: note linq scenarios (such linq sql, not in-memory linq such here), rewriting query produces more optimized plan:
from word in allwordslinq group word word groups select new { word = groups.key, count = groups.count() }
note however, doesn't give dictionary, rather sequence of words , counts. can transform dictionary with
(from word in allwordslinq group word word groups select new { word = groups.key, count = groups.count() }) .todictionary(g => g.word, g => g.count);
Comments
Post a Comment