Search   Memberlist   Usergroups
 Page 1 of 1 [11 Posts]
Author Message
Paul Nutteing
science forum beginner

Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 6:55 am    Post subject: Name of algorithm for pairwise comparison ?

Could someone suggest the algorithm name to research,
not necessarily the most time efficient
but the most straightforword to impliment.

Instead of brute force checking each line
against each other for the maximum number
of pairwise matches in each of the 18 columns
eg for 11 lines only.

dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
dgdeceeggicaehaeef

only 2 pairwise matches between rows 1 and 2 at column 2 and 15
Number of rows in application 10,000 to 100,000, all same length
The Last Danish Pastry
science forum Guru Wannabe

Joined: 29 Apr 2005
Posts: 176

Posted: Fri Jul 14, 2006 10:42 am    Post subject: Re: Name of algorithm for pairwise comparison ?

"Paul Nutteing (valid email address in post script )"
<nutteing@quickfindit.com> wrote in message
news:4hotefFk4kqU1@individual.net...

 Quote: Could someone suggest the algorithm name to research, not necessarily the most time efficient but the most straightforword to impliment. Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc dgdededhdccifhddce cebcdeadceeaffdeeg ddccdefgcfcgefdfef egcfcffhcebifcbeeg cgcceececdebdfeeef dddeddegdfceeecdee efddfggichgghhaddd dgdeceeggicaehaeef only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length

Sorry, I cannot suggest an algorithm name.

But I have a couple of questions:

- Are you only interested in the case of 18 columns?

- Are you only interested in the case of an alphabet of 8 letters?

--
Clive Tooth
www.clivetooth.dk
Stock photos:
http://submit.shutterstock.com/?ref=61771
Paul Nutteing
science forum beginner

Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 11:20 am    Post subject: Re: Name of algorithm for pairwise comparison ?

The Last Danish Pastry <clivet@gmail.com> wrote in message
news:4hpaopFluirU1@individual.net...
 Quote: "Paul Nutteing (valid email address in post script )" nutteing@quickfindit.com> wrote in message news:4hotefFk4kqU1@individual.net... Could someone suggest the algorithm name to research, not necessarily the most time efficient but the most straightforword to impliment. Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc dgdededhdccifhddce cebcdeadceeaffdeeg ddccdefgcfcgefdfef egcfcffhcebifcbeeg cgcceececdebdfeeef dddeddegdfceeecdee efddfggichgghhaddd dgdeceeggicaehaeef only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length Sorry, I cannot suggest an algorithm name. But I have a couple of questions: - Are you only interested in the case of 18 columns? - Are you only interested in the case of an alphabet of 8 letters? -- Clive Tooth www.clivetooth.dk Stock photos: http://submit.shutterstock.com/?ref=61771

18, 20 or 26 columns of about 15 letters

or as I'm actually interested in pairwise pair matches ,
2 at a time, then 9.10 or 13 columns of about 225 items
stush@rocketmail.com

Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 11:22 am    Post subject: Re: Name of algorithm for pairwise comparison ?

Paul Nutteing (valid email address in post script ) wrote:
 Quote: Could someone suggest the algorithm name to research, not necessarily the most time efficient but the most straightforword to impliment. Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc dgdededhdccifhddce cebcdeadceeaffdeeg ddccdefgcfcgefdfef egcfcffhcebifcbeeg cgcceececdebdfeeef dddeddegdfceeecdee efddfggichgghhaddd dgdeceeggicaehaeef only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length

So for any two lines, you want to know the most pairwise matches (in
the sense that two lines are more similar if they have more pairwise
matches I suppose).

Then you would have up to about 4999950000 lines of output (each row
compared against each higher row).

Anyhow...

You could sort each row (remembering the original row number) by each
column which is probably the most straight forward. The more columns
you have, the less atttractive.

You could also hash or store (x,y) where x is row number and y is
character at column y. Thus after one pass of the input, you would have
all rows with a "g" in column 2 together, etc.
The Last Danish Pastry
science forum Guru Wannabe

Joined: 29 Apr 2005
Posts: 176

Posted: Fri Jul 14, 2006 11:30 am    Post subject: Re: Name of algorithm for pairwise comparison ?

"Paul Nutteing (valid email address in post script )"
<nutteing@quickfindit.com> wrote in message
news:4hpcv5FlsahU1@individual.net...

 Quote: The Last Danish Pastry wrote in message news:4hpaopFluirU1@individual.net... "Paul Nutteing (valid email address in post script )" nutteing@quickfindit.com> wrote in message news:4hotefFk4kqU1@individual.net... Could someone suggest the algorithm name to research, not necessarily the most time efficient but the most straightforword to impliment. Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc dgdededhdccifhddce cebcdeadceeaffdeeg ddccdefgcfcgefdfef egcfcffhcebifcbeeg cgcceececdebdfeeef dddeddegdfceeecdee efddfggichgghhaddd dgdeceeggicaehaeef only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length Sorry, I cannot suggest an algorithm name. But I have a couple of questions: - Are you only interested in the case of 18 columns? - Are you only interested in the case of an alphabet of 8 letters? 18, 20 or 26 columns of about 15 letters or as I'm actually interested in pairwise pair matches , 2 at a time, then 9.10 or 13 columns of about 225 items

Thank you.

Assuming that the algorithm was running on, say, a 3GHz PC, about what
running time would be acceptable for the 10^5 rows, 13 columns of 225
items case?

--
Clive Tooth
www.clivetooth.dk
Stock photos:
http://submit.shutterstock.com/?ref=61771
Paul Nutteing
science forum beginner

Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 2:27 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

<stush@rocketmail.com> wrote in message
 Quote: Paul Nutteing (valid email address in post script ) wrote: Could someone suggest the algorithm name to research, not necessarily the most time efficient but the most straightforword to impliment. Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc dgdededhdccifhddce cebcdeadceeaffdeeg ddccdefgcfcgefdfef egcfcffhcebifcbeeg cgcceececdebdfeeef dddeddegdfceeecdee efddfggichgghhaddd dgdeceeggicaehaeef only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length So for any two lines, you want to know the most pairwise matches (in the sense that two lines are more similar if they have more pairwise matches I suppose). Then you would have up to about 4999950000 lines of output (each row compared against each higher row). Some clarification would be helpful. Anyhow... You could sort each row (remembering the original row number) by each column which is probably the most straight forward. The more columns you have, the less atttractive. You could also hash or store (x,y) where x is row number and y is character at column y. Thus after one pass of the input, you would have all rows with a "g" in column 2 together, etc.

Going into
file/properties/details/message/source
may give
an equal spaced font for viewing this.

The comparison check I wish to
run , small example here

Comparing the following 11 lines

dgdedfggcggbfhafab
dgceefdgehaahaegbc
dgdededhdccifhddce
ddccdefgcfcgefdfef
egcfcffhcebifcbeeg
cgcceececdebdfeeef
dddeddegdfceeecdee
dgdeceeggicaehaeef

There are no pairs of rows with
4 or more letter-pair (doublet) matches

1 triple match on "dg","de" and "fh" in the two
dgdededhdccifhddce
dgdedfggcggbfhafab

6 double pair matches

dgceefdgehaahaegbc on "eh" & "ha"

dgdeceeggicaehaeef
dgdedfggcggbfhafab

dgdeceeggicaehaeef
dgdededhdccifhddce

egcfcffhcebifcbeeg

cgcceececdebdfeeef
ddccdefgcfcgefdfef

dgdeceeggicaehaeef
dddeddegdfceeecdee

In the anticipated application
10,000 to 100,000 rows of 18,20 or 23
characters from a 15 character palette
taking 2 at a time.

I have a routine that checks row 1
against 2 to end,
row 2 against row 3 to end
etc but on an old pc it is taking all night to run
James Waldby
science forum Guru Wannabe

Joined: 01 May 2005
Posts: 114

Posted: Fri Jul 14, 2006 2:54 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

Paul Nutteing ... wrote:
 Quote: ...clivet@gmail... wrote... Paul Nutteing ... wrote ... .... Instead of brute force checking each line against each other for the maximum number of pairwise matches in each of the 18 columns eg for 11 lines only. fgadbdeeehcghaaecd dgdedfggcggbfhafab dgceefdgehaahaegbc .... only 2 pairwise matches between rows 1 and 2 at column 2 and 15 Number of rows in application 10,000 to 100,000, all same length .... - Are you only interested in the case of 18 columns? - Are you only interested in the case of an alphabet of 8 letters? 18, 20 or 26 columns of about 15 letters or as I'm actually interested in pairwise pair matches , 2 at a time, then 9.10 or 13 columns of about 225 items

Re pairwise matches, apparently you are indicating that
somehow or other a row of 2m letters has m pairs.

Suppose there are n rows, each having m symbols (your pairs)
over an alphabet of p symbols (your 15*15 letter pairs).
Following is an algorithm that uses O(m*n) time and O(m*n*p) space.
[Note, f(n,m) is O(m*n) if there exist constants C, K such that
n, m > K implies C*m*n > f(n,m).]

1. Create m*p list heads. Let h[i,j] represent the
list head for symbol j appearing in position i.

2. For each of m positions in each of n rows, if the entry
in position q of row r is symbol s, add r to list h[q,s].

3. Report contents of lists having more than 1 entry.

-jiw
stush@rocketmail.com

Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 5:46 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

Paul Nutteing (valid email address in post script ) wrote:

I assume you mean row 1 against row 2, row 3, ... row n; then row 2
against row 3, row4, ... etc?

Also...

dgcm
dgcp

Does that match only "dg", or also "dg" and "gc" (notice the shared g
in the 2nd case.)

Anyhow....

Just create an array of something like this

p[15][15][23]

p['d']['g'][5] means a "dg" appears in column 5. Now each p should
point to say a linked list of row numbers. Now for each pair in a row,
add it's row number to the appropiate linked list pointed to by an
element p. If it's not the first element of the list, then you have one
or more matches. For each row number already in the list you will have
to update a seperate data structure I describe next.

Say you have a hash table or bbtree or something. The key into this ds
is row number pairs (x,y). Here your store counts of matches. If this
isn't clear consider this example: in row 8, column 1, you have "ec".
So reference the list p['e']['c'][1]. If the list is empty, add 8.
continue. If there is a list there, say 4-->7-->null, then look up
(4, in the 2nd table. If it doesn't exist, insert it and set count to
1. If it exists, incr. count. Then do the same for (7,. Then add 8
to the list.

After one pass through your data, your 2nd table will have exactly as
its keys all row pairs with at least one match, and also the count of
those matches. If all you care about is the row pair(s) with the most
matches, you can just track that as you as you update the 2nd
structure, you can have ties of course, so use something like an std
set to do it.
Paul Nutteing
science forum beginner

Joined: 20 Jun 2005
Posts: 21

Posted: Fri Jul 14, 2006 6:03 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

 Quote: I assume you mean row 1 against row 2, row 3, ... row n; then row 2 against row 3, row4, ... etc?

Correct

 Quote: Also... dgcm dgcp Does that match only "dg", or also "dg" and "gc" (notice the shared g in the 2nd case.)

Yes srict barriers between pairs so dg/dg match only in dgcm/dgcp

 Quote: Anyhow.... Just create an array of something like this p[15][15][23] p['d']['g'][5] means a "dg" appears in column 5. Now each p should point to say a linked list of row numbers. Now for each pair in a row, add it's row number to the appropiate linked list pointed to by an element p. If it's not the first element of the list, then you have one or more matches. For each row number already in the list you will have to update a seperate data structure I describe next. Say you have a hash table or bbtree or something. The key into this ds is row number pairs (x,y). Here your store counts of matches. If this isn't clear consider this example: in row 8, column 1, you have "ec". So reference the list p['e']['c'][1]. If the list is empty, add 8. continue. If there is a list there, say 4-->7-->null, then look up (4, in the 2nd table. If it doesn't exist, insert it and set count to 1. If it exists, incr. count. Then do the same for (7,. Then add 8 to the list. After one pass through your data, your 2nd table will have exactly as its keys all row pairs with at least one match, and also the count of those matches. If all you care about is the row pair(s) with the most matches, you can just track that as you as you update the 2nd structure, you can have ties of course, so use something like an std set to do it.

I will explore this at the weekend, thanks

For 23 in previous, read 26

Moving recently to the 13/26 structure
I have split the routine into 2
The "each compared to each other" on 9 of the
13 columns
outputting all pairs with doublet
matches >=4.
Then a simple 2 row comparator for the remaining
4 columns. This speeds things up a bit
but I can't go further down that route as
I'm interested in 8 or more pairwise matches
between 2 rows. Outputting a file of >=3
pairwise matches in 8 of the 13 cols would
be a very large file and
slow things down too much probably compared to
expanding the 2 row comparator from 4 to 5 columns.
stush@rocketmail.com

Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 6:35 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

Ah, since this is more a filter than a statistics collector, you can
drop the 2nd data structure.

While reading say row 100, just retain match count with rows 1-99. You
can do this very quickly if you reuse memory, for example, create a
vector of nodes and a pointer to the next free node (initially the
first). Now use this to build a b-tree, when you need to add a node to
the tree, just point to the next free node in the vector (growing the
vector as neccesary). When you are done with the b-tree (you reach the
end of the record), just move the next free back the first and set the
root pointer (of the b-tree) to null. Just remember that when you grab
a free node, you will have to explicitly reset it's left/right child
pointers back to null. (pointers could just be indices into the vector
of course with -1 as null). Doing it this ways save you time since you
don't have have to mess around with memory allocation/deallocation
every record.

You are of course outputing your rows when count == 8 so no need to
walk the tree.
stush@rocketmail.com

Joined: 10 Nov 2005
Posts: 73

Posted: Fri Jul 14, 2006 7:19 pm    Post subject: Re: Name of algorithm for pairwise comparison ?

stush@rocketmail.com wrote:
 Quote: Ah, since this is more a filter than a statistics collector, you can drop the 2nd data structure. While reading say row 100, just retain match count with rows 1-99. You can do this very quickly if you reuse memory, for example, create a vector of nodes and a pointer to the next free node (initially the first). Now use this to build a b-tree, when you need to add a node to the tree, just point to the next free node in the vector (growing the vector as neccesary). When you are done with the b-tree (you reach the end of the record), just move the next free back the first and set the root pointer (of the b-tree) to null. Just remember that when you grab a free node, you will have to explicitly reset it's left/right child pointers back to null. (pointers could just be indices into the vector of course with -1 as null). Doing it this ways save you time since you don't have have to mess around with memory allocation/deallocation every record. You are of course outputing your rows when count == 8 so no need to walk the tree.

SILLY SILLY ME

All you need is a vector of pairs (x,y).

v[i] = (x,y) means row x (the current row) matches row i, y times. You
need to store the current row because you will be re-using vector slots
each iteration, i.e. if row 3 matches row 10, 4 times, after reading
row 10, v[3] = (10,4). Now you read row 11, you need to know that v[3]
is still holding the count for rows 3,10. This way you don't need to

Of course add one vector element each iteration.

 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First
 Page 1 of 1 [11 Posts]
 The time now is Sun Apr 21, 2019 12:23 am | All times are GMT
 Jump to: Select a forum-------------------Forum index|___Science and Technology    |___Math    |   |___Research    |   |___num-analysis    |   |___Symbolic    |   |___Combinatorics    |   |___Probability    |   |   |___Prediction    |   |       |   |___Undergraduate    |   |___Recreational    |       |___Physics    |   |___Research    |   |___New Theories    |   |___Acoustics    |   |___Electromagnetics    |   |___Strings    |   |___Particle    |   |___Fusion    |   |___Relativity    |       |___Chem    |   |___Analytical    |   |___Electrochem    |   |   |___Battery    |   |       |   |___Coatings    |       |___Engineering        |___Control        |___Mechanics        |___Chemical

 Topic Author Forum Replies Last Post Similar Topics how to deduce a number validation algorithm? Tim_Mac Math 1 Fri Jul 21, 2006 10:43 am Need an algorithm Dave111 Math 4 Fri Jul 14, 2006 2:19 am A POSSIBLE FACTORING ALGORITHM Achilleas D Palamiotis Math 7 Thu Jul 13, 2006 6:14 am Algorithm to fit rectangles with different areas inside a... Dani Camps Research 0 Tue Jul 11, 2006 9:14 am ? algorithm to fill out the gap of Sherman-Morrison formula Cheng Cosine num-analysis 7 Sun Jul 09, 2006 5:13 pm