=head1 NAME aie - Automatic Information Extraction =head1 DESCRIPTION Attempts to extract regular information from non-binary files. AIE accepts any non-binary file as input. It tries to find a repeating sequence in the file and then generalizes a regular expression to extract the information that varies within the repeating structure. =head1 SYNOPSIS $ aie "./Downloadable NLG systems - ACL Wiki.html" Extracting major patterns Length: 40136 . ........................................ Extracting most useful terms Chose token: $VAR1 = ' class="'; Selected instance 133 of 185 $VAR1 = [ '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)re(.*) \\<\\/p\\>\\\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\(.*)fo(.*)cl(.*)as(.*)la(.*)as(.*)re(.*)as(.*)re(.*)re(.*) c(.*)re(.*) \\<\\/p\\> \\<(.*)\\>', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\(.*)as(.*)re(.*) c(.*)re(.*)rela(.*)as(.*)fo(.*)as(.*) c(.*)la(.*)re(.*)re(.*)\\" (.*)la(.*)as(.*)fo(.*)la(.*)re(.*)cl(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', ' class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\(.*)fo(.*) \\<\\/p\\> \\<(.*)\\>', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"' ]; $VAR1 = ' class="(.*)e" (.*)="(.*)">(.*)<'; Extracted 23 records $VAR1 = [ [ 'mw-headlin', 'id', 'ASTROGEN', 'ASTROGEN', 'h2>' ], [ 'mw-headlin', 'id', 'Chimera', 'Chimera', 'h2>' ], [ 'mw-headlin', 'id', 'CRISP', 'CRISP', 'h2>' ], ... =head1 AUTHOR Andrew John Dougherty =head1 LICENSE GPLv3 =head1 INSTALLATION Using C: $ cpanm Org::FRDCSA::AIE Manual install: $ perl Makefile.PL $ make $ make install =cut