Extract data from large structured file using Java/Python -


i have large text file (~100mb) need parsed extract information. find efficient way of doing it. file structured in block:

mon, 01 jan 2010 01:01:01   token1 = valuexyz   token2 = valueabc   token3 = valuepqr   ...   tokenx = value123  mon, 01 jan 2010 01:02:01   token1 = valuexyz   token2 = valueabc   token3 = valuepqr   ...   tokeny = value456 

is there library in parsing file? (in java, python, command line tool)

edit: know question vague, key element not way read file, parse regex, etc. looking more in library, or tools suggestions in terms of performance. example, antlr have been possibility, tool loads whole file in memory, not good.

thanks!

for efficient parsing of files, on big file, can use awk. example

$ awk -vrs= '{print "====>" $0}' file ====>mon, 01 jan 2010 01:01:01   token1 = valuexyz   token2 = valueabc   token3 = valuepqr   ...   tokenx = value123 ====>mon, 01 jan 2010 01:02:01   token1 = valuexyz   token2 = valueabc   token3 = valuepqr   ...   tokeny = value456 ====>mon, 01 jan 2010 01:03:01   token1 = valuexyz   token2 = valueabc   token3 = valuepqr 

as can see arrows , each record 1 block "====>" arrows next (by setting record separator rs blanks). can set field separator, eg newline

$ awk -vrs= -vfs="\n" '{print "====>" $1}' file ====>mon, 01 jan 2010 01:01:01 ====>mon, 01 jan 2010 01:02:01 ====>mon, 01 jan 2010 01:03:01 

so in above example, every 1st field date/time stamp. "token1" example, this

$ awk -vrs= -vfs="\n" '{for(i=1;i<=nf;i++) if ($i ~/token1/){ print $i} }' file   token1 = valuexyz   token1 = valuexyz   token1 = valuexyz 

Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -