Author Topic: Fellow Coders of RMRK: (Read 4982 times)

#1 • Jun 22, 2010

I need your help on my project.

I may require more help later, idk.

Basically, here's what I need done (not that I'm asking you guys to do it for me but rather give me an idea, conceptually, of what I need to do):

I have a .txt file with a certain number of characters in it. Quite a lot, around 100,000. What I need to do is choose a certain length of characters (for the purposes of explaining, I think the number 20 will be good) and find repeats within the .txt file.

#2 • Jun 22, 2010

Regular Expressions are key here, I would think.
Set a regular expression with that 20 characters, and then keep going through the text file.

#3 • Jun 22, 2010

But the 20 characters aren't defined. I'll be straightforward and say that I'm looking for repeats of DNA sequences. There are four possible types of bases (A, T, G, C). Since there are four choices, and the spaces number 20, the possible amount of potential combinations is 4^20. >_>

Oh, but I had an idea; I'm searching for repeats within the human genome. My mentor told me I should be using a hash array/dictionary/hash table to store the number of repeats. What do you guys think about me doing something like this (this is pseudocode):

Code: [Select]

import thang.txt
a = 0
b = 19
dictionary = {}
for b <= length_of_file:
     c = get_characters(a,b)
     if c in dictionary{} = false
          dictionary['c'] = 1
     elsif
          d = dictionary['c']
          d += 1
          dictionary['c'] = d
     end
     a += 1
     b += 1
end

I'm working with Python, I think, so the coding's a bit Python-ey. So basically, I take twenty characters from the start and see if it exists in the dictionary/hash array/hash table. If it doesn't, it creates it and sets the value to 1, meaning there's at least one instance of it in the sequence. Then, it sets the values to check (a and b) to +1. So it'll check characters 2-21 if the sequence exists in the dictionary and does the same stuff for it.

Does anyone knowledgeable of Python's limitations know if such a feat is possible in Python?

#4 • Jun 22, 2010

I use Python. This is easily possible. And since this is pseudo code, I won't remark on small errors I see.
I don't know if that solution would work, but seeing how easy it is to try things in Python, I am sure you will get it right.

#5 • Jun 22, 2010

Oh, you use Python?

Do you mind getting on the IRC and helping me code this a bit? I know basically what I want to do/try, but I don't know the exact syntax.

#6 • Jun 22, 2010

Ok, I am on for a while

#7 • Jun 22, 2010

Okay, I've created my code, and I've run into a problem, but I think I've diagnosed the problem, but I don't know how to fix it.

Code: [Select]

while (kmer_end <= SEQUENCE_LENGTH):                               # While the end of the k-mer does not exceed the file length, do loop:
    read_kmer = sequence_to_be_read.read(KMER_LENGTH)              # Get the k-mer to be read.
    if (read_kmer in repeat_sequences == False):                   # If the k-mer is not in the dictionary,
        repeat_sequences[read_kmer] = 1                            # Add it to the dictionary and set the occurence to 1.
    elif (read_kmer in repeat_sequences == True):                  # If the k-mer IS in the dictionary,
        repeat_sequences[read_kmer] += 1                           # And set the actual count to that number.

And yeah, I've heavily commented on the whole script, lol. In fact, basically every line has a comment on it. Anyways, the problem is that the k-mers aren't being written. I've even found out that the program can print out all the possible k-mers. The AATGC, GTAAC, whatever. My script also identifies correctly whether or not that the k-mers are actually in the dictionary. I initialized the dictionary with one k-mer that I knew occurred 3 times, and the script found it all "True" for the three instances. But everything else failed. I don't know how I can properly write it so that the dictionary will write the keys correctly.

I've tried converting the read_kmer's to string, but it doesn't work. Either that or they're already strings.

EDIT: It's the latter.

#8 • Jun 22, 2010

Try this. (This is slightly more Python-ic then your original code)

Code: [Select]

# Opens a file with the genetic information
sequence_to_be_read = open("whateverfile.txt", "r") 
# Reads the first line in the text file and takes off the newline character
read_kmer = sequence_to_be_read.readline().strip("\n") 
# While not at the end of the file, do loop
while read_kmer:
    # If the genetic information isn't in your dictionary
    if read_kmer not in repeat_sequences:
        # Add it to the dictionary
        repeat_sequences[read_kmer] = 1
    # Else, if genetic information is in the dictionary
    else:
        # Add 1 to number for that genetic information
        repeat_sequences[read_kmer] += 1
    # Read next line in file and take out the newline character
    read_kmer = sequence_to_be_read.readline().strip("\n")

#9 • Jun 22, 2010

Yeah, I got help earlier from elsewhere because you were gone, but they told me to get rid of the "== True" in the "in repeat_sequences == True" part, and that fixed my problem. So, so far, the script is going great! There are some things I need to work to change, though.

First, in the file that I'm going to fully run the code through (well, this file is the test file, and it's only 100,000 letters), there are 50 letters per line, and there's a space after each set of 50. If the script doesn't read it properly, then it's going to appear with \n's everywhere. Wait, I just looked at your code, redyugi, and there's that .strip("\n") part. Will that apply change make it work for the entire .txt file?

Also, I need to create an option whether or not to import an already existing dictionary considering there are 23 full files I need to run the program through. Why 23? 23 chromosomes of a human. P: But this should be pretty easy.

Also, for your "while read_kmer:," wouldn't that make it so that if the length of the k-mer is 5 and you're 4 characters away from the end, it'll still try and run the code?

#10 • Jun 23, 2010

Quote from: mastermoo420 on June 22, 2010, 10:20:25 PM

Yeah, I got help earlier from elsewhere because you were gone, but they told me to get rid of the "== True" in the "in repeat_sequences == True" part, and that fixed my problem. So, so far, the script is going great! There are some things I need to work to change, though.

Good

Quote

First, in the file that I'm going to fully run the code through (well, this file is the test file, and it's only 100,000 letters), there are 50 letters per line, and there's a space after each set of 50. If the script doesn't read it properly, then it's going to appear with \n's everywhere. Wait, I just looked at your code, redyugi, and there's that .strip("\n") part. Will that apply change make it work for the entire .txt file?

What that does is that it strips your current line of "\n". So for the blank line, it turns "\n" to "". This would end the loop I set above because "" returns a Boolean value of False.

Quote

Also, I need to create an option whether or not to import an already existing dictionary considering there are 23 full files I need to run the program through. Why 23? 23 chromosomes of a human. P: But this should be pretty easy.

Yes, this would be rather easy. Ask if you need help though.

Quote

Also, for your "while read_kmer:," wouldn't that make it so that if the length of the k-mer is 5 and you're 4 characters away from the end, it'll still try and run the code?

That makes it loop if the line it is on isn't the last line or empty. It does not check the length. You would have to do that yourself, however that is a simple "if" statement so...

#11 • Jun 23, 2010

Yeah, my original coding for that kinda takes care of that. I guess I'll have to add in stripping (oh-la-la) the text. Also, would len(sequence_to_be_read) return a value that displays the number of characters in the file?

#12 • Jun 23, 2010

I am not sure. Try it in IDLE(or whatever IDE you are using). That is what it is there for. I test out small ideas like that all the time

#13 • Jun 23, 2010

I use unix at my internship so, lol. Yeh. But yeah, I do test out small things, too. I get the general concept down and then I implement.

#14 • Jun 23, 2010

Oh. lol. Well I don't have Python on my desktop so I can't really try that now either.

#15 • Jun 23, 2010

Okay, redyugi, I see how the .strip("\n") works now, but is there a way to change it? Because it reads each line individually, but, say I had this:

Code: [Select]

AAAAA
TTTTT
GGGG
CCCCC

If the given length of the k-mer is 4, then it'd just read:
AAAA
AAAA
TTTT
TTTT

because it doesn't connect the end of a line to the beginning of the next.

#16 • Jun 23, 2010

Well I suppose you could do this. I don't think this is very efficient but...

Code: [Select]

# Opens the file
sequence_to_be_read = open("whatever.txt", "r")
# Entire file in one long string, taking out the "\n" and replacing it with " "
length_of_file = sequence_to_be_read.read().replace("\n", " ")
# Insert the rest of the code here
# When you need to find the length, just use this 
# len(length_of_file.replace(" ", "")
# Insert this on last line of the loop
# Splits the string into an array "AAAA TTTT" --> ["AAAA", "TTTT"]
temp = length_of_file.split()
# Delete the line you just read
del temp[0]
# Change it back to string ["TTTT", "GGGG"] --> "TTTT GGGG"
length_of_file = " ".join(temp)

#17 • Jun 23, 2010

Actually, I got everything working!

#python at irc.freenode.net has a lot of smart people P: For now, I'm done with my work until my mentor answers my questions. Most of the actual files that I'll have to work with contain some weird stuff with the name of the file at the beginning (like ">chr1.fa"), but I'm happy that the program prunes out all the repetitions lower than a certain number, so those are just thrown out!

EDIT: I think it is finished! Time to run it through a whole, actual chromosome sequence.

EDIT: Memory error.

I should have seen this coming.
http://paste.pocoo.org/show/228995/

That's what I had, but somebody told me to use streaming (Generators specificall) for large files. Help? P:

EDIT: This is what I got link, but it runs the script in a split second and there's nothing in the dictionary that's written.

#18 • Jun 24, 2010

Bump. ._.
http://paste.pocoo.org/show/229358/

#19 • Jun 24, 2010

Bamp with update. Like I said, I had a memory error earlier. So, I decided to fix it by partitioning the original file into 10 parts.
http://paste.pocoo.org/show/229537/

Yay! I hope it works. I'm trying to run it through a chromosome right now.

#20 • Jun 24, 2010

copy a part of the code

ctrl+f

paste

Search

?

done

#21 • Jun 24, 2010

Oh, btw, I found out that this works. Problem is that it works at a horrible speed.

#22 • Jul 7, 2010

Bump: ._.
http://paste.pocoo.org/show/234631/