Write in Perl a script that lists specific length strings that are more common to occur in one inputfile than in the other input file(s).
The script must take the following command line inputs: $N $inputfile1 $inputfile2 ...
$N must be an integer between 2 and 100, there must be at least two given input files or more. The input files are text files.
When started, the script will read each given input file and extract all non-duplicate strings of $N character length. The data is case sensitive.
For example, should inputfile1 contain data "foobar" and inputfile2 contain data "foo" and $N = 3, the extracted 3 character strings would be: foo, oob oba, bar. These shall be called SearchWords. The SearchWords must not contain the linefeed characters.
After all the SearchWords strings are extracted, the script must count the occurrence of each of SearchWords from all inputfiles and calculate an occurrence rate for each.
For example, in the example case of above, the SearchWord "foo" would have occurrence rate of 50% as the SearchWord is found one time in both input files. SearchWords "oob", "oba" and "bar" have occurrence rate of 100% because they are only found in one inputfile.
After you know the occurrence rate of a SearchWord and if the occurrence rate is over 90%, write that SearchWord to an outputfile with its occurrence rate. One SearchWord=occurrence rate per line.
The script must be able to handle up to 100 input files and must be able to handle big input files (within the limitation of available RAM). The script must report to screen its current progress including the number of SearchWords found per inputfile, how long the analysis has been running and estimate on how long it will still run. The excepted input files will be many megabytes in size.
Sample input and output files included. Do not suggest solutions with other programming languages.