Author 
Message 
Paul Nutteing science forum beginner
Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 6:55 am Post subject:
Name of algorithm for pairwise comparison ?



Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length 

Back to top 


The Last Danish Pastry science forum Guru Wannabe
Joined: 29 Apr 2005
Posts: 176

Posted: Fri Jul 14, 2006 10:42 am Post subject:
Re: Name of algorithm for pairwise comparison ?



"Paul Nutteing (valid email address in post script )"
<nutteing@quickfindit.com> wrote in message
news:4hotefFk4kqU1@individual.net...
Quote:  Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length

Sorry, I cannot suggest an algorithm name.
But I have a couple of questions:
 Are you only interested in the case of 18 columns?
 Are you only interested in the case of an alphabet of 8 letters?

Clive Tooth
www.clivetooth.dk
Stock photos:
http://submit.shutterstock.com/?ref=61771 

Back to top 


Paul Nutteing science forum beginner
Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 11:20 am Post subject:
Re: Name of algorithm for pairwise comparison ?



The Last Danish Pastry <clivet@gmail.com> wrote in message
news:4hpaopFluirU1@individual.net...
Quote:  "Paul Nutteing (valid email address in post script )"
nutteing@quickfindit.com> wrote in message
news:4hotefFk4kqU1@individual.net...
Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
Sorry, I cannot suggest an algorithm name.
But I have a couple of questions:
 Are you only interested in the case of 18 columns?
 Are you only interested in the case of an alphabet of 8 letters?

Clive Tooth
www.clivetooth.dk
Stock photos:
http://submit.shutterstock.com/?ref=61771

18, 20 or 26 columns of about 15 letters
or as I'm actually interested in pairwise pair matches ,
2 at a time, then 9.10 or 13 columns of about 225 items 

Back to top 


stush@rocketmail.com science forum addict
Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 11:22 am Post subject:
Re: Name of algorithm for pairwise comparison ?



Paul Nutteing (valid email address in post script ) wrote:
Quote:  Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length

So for any two lines, you want to know the most pairwise matches (in
the sense that two lines are more similar if they have more pairwise
matches I suppose).
Then you would have up to about 4999950000 lines of output (each row
compared against each higher row).
Some clarification would be helpful.
Anyhow...
You could sort each row (remembering the original row number) by each
column which is probably the most straight forward. The more columns
you have, the less atttractive.
You could also hash or store (x,y) where x is row number and y is
character at column y. Thus after one pass of the input, you would have
all rows with a "g" in column 2 together, etc. 

Back to top 


The Last Danish Pastry science forum Guru Wannabe
Joined: 29 Apr 2005
Posts: 176

Posted: Fri Jul 14, 2006 11:30 am Post subject:
Re: Name of algorithm for pairwise comparison ?



"Paul Nutteing (valid email address in post script )"
<nutteing@quickfindit.com> wrote in message
news:4hpcv5FlsahU1@individual.net...
Quote:  The Last Danish Pastry <clivet@gmail.com> wrote in message
news:4hpaopFluirU1@individual.net...
"Paul Nutteing (valid email address in post script )"
nutteing@quickfindit.com> wrote in message
news:4hotefFk4kqU1@individual.net...
Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
Sorry, I cannot suggest an algorithm name.
But I have a couple of questions:
 Are you only interested in the case of 18 columns?
 Are you only interested in the case of an alphabet of 8 letters?
18, 20 or 26 columns of about 15 letters
or as I'm actually interested in pairwise pair matches ,
2 at a time, then 9.10 or 13 columns of about 225 items

Thank you.
Assuming that the algorithm was running on, say, a 3GHz PC, about what
running time would be acceptable for the 10^5 rows, 13 columns of 225
items case?

Clive Tooth
www.clivetooth.dk
Stock photos:
http://submit.shutterstock.com/?ref=61771 

Back to top 


Paul Nutteing science forum beginner
Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 2:27 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



<stush@rocketmail.com> wrote in message
news:1152876145.392661.185250@m73g2000cwd.googlegroups.com...
Quote:  Paul Nutteing (valid email address in post script ) wrote:
Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
So for any two lines, you want to know the most pairwise matches (in
the sense that two lines are more similar if they have more pairwise
matches I suppose).
Then you would have up to about 4999950000 lines of output (each row
compared against each higher row).
Some clarification would be helpful.
Anyhow...
You could sort each row (remembering the original row number) by each
column which is probably the most straight forward. The more columns
you have, the less atttractive.
You could also hash or store (x,y) where x is row number and y is
character at column y. Thus after one pass of the input, you would have
all rows with a "g" in column 2 together, etc.

Going into
file/properties/details/message/source
may give
an equal spaced font for viewing this.
The comparison check I wish to
run , small example here
Comparing the following 11 lines
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
There are no pairs of rows with
4 or more letterpair (doublet) matches
1 triple match on "dg","de" and "fh" in the two
dgdededhdccifhddce
dgdedfggcggbfhafab
6 double pair matches
dgceefdgehaahaegbc on "eh" & "ha"
fgadbdeeehcghaaecd
dgdeceeggicaehaeef
dgdedfggcggbfhafab
dgdeceeggicaehaeef
dgdededhdccifhddce
egcfcffhcebifcbeeg
cebcdeadceeaffdeeg
cgcceececdebdfeeef
ddccdefgcfcgefdfef
dgdeceeggicaehaeef
dddeddegdfceeecdee
In the anticipated application
10,000 to 100,000 rows of 18,20 or 23
characters from a 15 character palette
taking 2 at a time.
I have a routine that checks row 1
against 2 to end,
row 2 against row 3 to end
etc but on an old pc it is taking all night to run 

Back to top 


James Waldby science forum Guru Wannabe
Joined: 01 May 2005
Posts: 114

Posted: Fri Jul 14, 2006 2:54 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



Paul Nutteing ... wrote:
Quote:  ...clivet@gmail... wrote...
Paul Nutteing ... wrote ...
....
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
....
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
....
 Are you only interested in the case of 18 columns?
 Are you only interested in the case of an alphabet of 8 letters?
18, 20 or 26 columns of about 15 letters
or as I'm actually interested in pairwise pair matches ,
2 at a time, then 9.10 or 13 columns of about 225 items

Re pairwise matches, apparently you are indicating that
somehow or other a row of 2m letters has m pairs.
Suppose there are n rows, each having m symbols (your pairs)
over an alphabet of p symbols (your 15*15 letter pairs).
Following is an algorithm that uses O(m*n) time and O(m*n*p) space.
[Note, f(n,m) is O(m*n) if there exist constants C, K such that
n, m > K implies C*m*n > f(n,m).]
1. Create m*p list heads. Let h[i,j] represent the
list head for symbol j appearing in position i.
2. For each of m positions in each of n rows, if the entry
in position q of row r is symbol s, add r to list h[q,s].
3. Report contents of lists having more than 1 entry.
jiw 

Back to top 


stush@rocketmail.com science forum addict
Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 5:46 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



Paul Nutteing (valid email address in post script ) wrote:
Quote:  stush@rocketmail.com> wrote in message
news:1152876145.392661.185250@m73g2000cwd.googlegroups.com...
Paul Nutteing (valid email address in post script ) wrote:
Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.
Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
So for any two lines, you want to know the most pairwise matches (in
the sense that two lines are more similar if they have more pairwise
matches I suppose).
Then you would have up to about 4999950000 lines of output (each row
compared against each higher row).
Some clarification would be helpful.
Anyhow...
You could sort each row (remembering the original row number) by each
column which is probably the most straight forward. The more columns
you have, the less atttractive.
You could also hash or store (x,y) where x is row number and y is
character at column y. Thus after one pass of the input, you would have
all rows with a "g" in column 2 together, etc.
Going into
file/properties/details/message/source
may give
an equal spaced font for viewing this.
The comparison check I wish to
run , small example here
Comparing the following 11 lines
fgadbdeeehcghaaecd
dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
cebcdeadceeaffdeeg
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
efddfggichgghhaddd
dgdeceeggicaehaeef
There are no pairs of rows with
4 or more letterpair (doublet) matches
1 triple match on "dg","de" and "fh" in the two
dgdededhdccifhddce
dgdedfggcggbfhafab
6 double pair matches
dgceefdgehaahaegbc on "eh" & "ha"
fgadbdeeehcghaaecd
dgdeceeggicaehaeef
dgdedfggcggbfhafab
dgdeceeggicaehaeef
dgdededhdccifhddce
egcfcffhcebifcbeeg
cebcdeadceeaffdeeg
cgcceececdebdfeeef
ddccdefgcfcgefdfef
dgdeceeggicaehaeef
dddeddegdfceeecdee
In the anticipated application
10,000 to 100,000 rows of 18,20 or 23
characters from a 15 character palette
taking 2 at a time.
I have a routine that checks row 1
against 2 to end,
row 2 against row 3 to end
etc but on an old pc it is taking all night to run

I assume you mean row 1 against row 2, row 3, ... row n; then row 2
against row 3, row4, ... etc?
Also...
dgcm
dgcp
Does that match only "dg", or also "dg" and "gc" (notice the shared g
in the 2nd case.)
Anyhow....
Just create an array of something like this
p[15][15][23]
p['d']['g'][5] means a "dg" appears in column 5. Now each p should
point to say a linked list of row numbers. Now for each pair in a row,
add it's row number to the appropiate linked list pointed to by an
element p. If it's not the first element of the list, then you have one
or more matches. For each row number already in the list you will have
to update a seperate data structure I describe next.
Say you have a hash table or bbtree or something. The key into this ds
is row number pairs (x,y). Here your store counts of matches. If this
isn't clear consider this example: in row 8, column 1, you have "ec".
So reference the list p['e']['c'][1]. If the list is empty, add 8.
continue. If there is a list there, say 4>7>null, then look up
(4, in the 2nd table. If it doesn't exist, insert it and set count to
1. If it exists, incr. count. Then do the same for (7,. Then add 8
to the list.
After one pass through your data, your 2nd table will have exactly as
its keys all row pairs with at least one match, and also the count of
those matches. If all you care about is the row pair(s) with the most
matches, you can just track that as you as you update the 2nd
structure, you can have ties of course, so use something like an std
set to do it. 

Back to top 


Paul Nutteing science forum beginner
Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 6:03 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



Quote:  I assume you mean row 1 against row 2, row 3, ... row n; then row 2
against row 3, row4, ... etc?

Correct
Quote:  Also...
dgcm
dgcp
Does that match only "dg", or also "dg" and "gc" (notice the shared g
in the 2nd case.)

Yes srict barriers between pairs so dg/dg match only in dgcm/dgcp
Quote:  Anyhow....
Just create an array of something like this
p[15][15][23]
p['d']['g'][5] means a "dg" appears in column 5. Now each p should
point to say a linked list of row numbers. Now for each pair in a row,
add it's row number to the appropiate linked list pointed to by an
element p. If it's not the first element of the list, then you have one
or more matches. For each row number already in the list you will have
to update a seperate data structure I describe next.
Say you have a hash table or bbtree or something. The key into this ds
is row number pairs (x,y). Here your store counts of matches. If this
isn't clear consider this example: in row 8, column 1, you have "ec".
So reference the list p['e']['c'][1]. If the list is empty, add 8.
continue. If there is a list there, say 4>7>null, then look up
(4, in the 2nd table. If it doesn't exist, insert it and set count to
1. If it exists, incr. count. Then do the same for (7,. Then add 8
to the list.
After one pass through your data, your 2nd table will have exactly as
its keys all row pairs with at least one match, and also the count of
those matches. If all you care about is the row pair(s) with the most
matches, you can just track that as you as you update the 2nd
structure, you can have ties of course, so use something like an std
set to do it.

I will explore this at the weekend, thanks
For 23 in previous, read 26
Moving recently to the 13/26 structure
I have split the routine into 2
The "each compared to each other" on 9 of the
13 columns
outputting all pairs with doublet
matches >=4.
Then a simple 2 row comparator for the remaining
4 columns. This speeds things up a bit
but I can't go further down that route as
I'm interested in 8 or more pairwise matches
between 2 rows. Outputting a file of >=3
pairwise matches in 8 of the 13 cols would
be a very large file and
slow things down too much probably compared to
expanding the 2 row comparator from 4 to 5 columns. 

Back to top 


stush@rocketmail.com science forum addict
Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 6:35 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



Quote:  Just create an array of something like this
p[15][15][23]
p['d']['g'][5] means a "dg" appears in column 5. Now each p should
point to say a linked list of row numbers. Now for each pair in a row,
add it's row number to the appropiate linked list pointed to by an
element p. If it's not the first element of the list, then you have one
or more matches. For each row number already in the list you will have
to update a seperate data structure I describe next.
Say you have a hash table or bbtree or something. The key into this ds
is row number pairs (x,y). Here your store counts of matches. If this
isn't clear consider this example: in row 8, column 1, you have "ec".
So reference the list p['e']['c'][1]. If the list is empty, add 8.
continue. If there is a list there, say 4>7>null, then look up
(4, in the 2nd table. If it doesn't exist, insert it and set count to
1. If it exists, incr. count. Then do the same for (7,. Then add 8
to the list.
After one pass through your data, your 2nd table will have exactly as
its keys all row pairs with at least one match, and also the count of
those matches. If all you care about is the row pair(s) with the most
matches, you can just track that as you as you update the 2nd
structure, you can have ties of course, so use something like an std
set to do it.
I will explore this at the weekend, thanks
For 23 in previous, read 26
Moving recently to the 13/26 structure
I have split the routine into 2
The "each compared to each other" on 9 of the
13 columns
outputting all pairs with doublet
matches >=4.
Then a simple 2 row comparator for the remaining
4 columns. This speeds things up a bit
but I can't go further down that route as
I'm interested in 8 or more pairwise matches
between 2 rows. Outputting a file of >=3
pairwise matches in 8 of the 13 cols would
be a very large file and
slow things down too much probably compared to
expanding the 2 row comparator from 4 to 5 columns.

Ah, since this is more a filter than a statistics collector, you can
drop the 2nd data structure.
While reading say row 100, just retain match count with rows 199. You
can do this very quickly if you reuse memory, for example, create a
vector of nodes and a pointer to the next free node (initially the
first). Now use this to build a btree, when you need to add a node to
the tree, just point to the next free node in the vector (growing the
vector as neccesary). When you are done with the btree (you reach the
end of the record), just move the next free back the first and set the
root pointer (of the btree) to null. Just remember that when you grab
a free node, you will have to explicitly reset it's left/right child
pointers back to null. (pointers could just be indices into the vector
of course with 1 as null). Doing it this ways save you time since you
don't have have to mess around with memory allocation/deallocation
every record.
You are of course outputing your rows when count == 8 so no need to
walk the tree. 

Back to top 


stush@rocketmail.com science forum addict
Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 7:19 pm Post subject:
Re: Name of algorithm for pairwise comparison ?



stush@rocketmail.com wrote:
Quote:  Ah, since this is more a filter than a statistics collector, you can
drop the 2nd data structure.
While reading say row 100, just retain match count with rows 199. You
can do this very quickly if you reuse memory, for example, create a
vector of nodes and a pointer to the next free node (initially the
first). Now use this to build a btree, when you need to add a node to
the tree, just point to the next free node in the vector (growing the
vector as neccesary). When you are done with the btree (you reach the
end of the record), just move the next free back the first and set the
root pointer (of the btree) to null. Just remember that when you grab
a free node, you will have to explicitly reset it's left/right child
pointers back to null. (pointers could just be indices into the vector
of course with 1 as null). Doing it this ways save you time since you
don't have have to mess around with memory allocation/deallocation
every record.
You are of course outputing your rows when count == 8 so no need to
walk the tree.

SILLY SILLY ME
All you need is a vector of pairs (x,y).
v[i] = (x,y) means row x (the current row) matches row i, y times. You
need to store the current row because you will be reusing vector slots
each iteration, i.e. if row 3 matches row 10, 4 times, after reading
row 10, v[3] = (10,4). Now you read row 11, you need to know that v[3]
is still holding the count for rows 3,10. This way you don't need to
zero your counts.
Of course add one vector element each iteration. 

Back to top 


Google


Back to top 



The time now is Sun Apr 21, 2019 12:23 am  All times are GMT

