RMRK is retiring.
Registration is disabled. The site will remain online, but eventually become a read-only archive. More information.

RMRK.net has nothing to do with Blockchains, Cryptocurrency or NFTs. We have been around since the early 2000s, but there is a new group using the RMRK name that deals with those things. We have nothing to do with them.
NFTs are a scam, and if somebody is trying to persuade you to buy or invest in crypto/blockchain/NFT content, please turn them down and save your money. See this video for more information.
Fellow Coders of RMRK:

0 Members and 1 Guest are viewing this topic.

*
A Random Custom Title
Rep:
Level 96
wah
I need your help on my project. :( I may require more help later, idk.

Basically, here's what I need done (not that I'm asking you guys to do it for me but rather give me an idea, conceptually, of what I need to do):

I have a .txt file with a certain number of characters in it. Quite a lot, around 100,000. What I need to do is choose a certain length of characters (for the purposes of explaining, I think the number 20 will be good) and find repeats within the .txt file.

****
Rep:
Level 83
Regular Expressions are key here, I would think.
Set a regular expression with that 20 characters, and then keep going through the text file.

*
A Random Custom Title
Rep:
Level 96
wah
But the 20 characters aren't defined. I'll be straightforward and say that I'm looking for repeats of DNA sequences. There are four possible types of bases (A, T, G, C). Since there are four choices, and the spaces number 20, the possible amount of potential combinations is 4^20. >_>

Oh, but I had an idea; I'm searching for repeats within the human genome. My mentor told me I should be using a hash array/dictionary/hash table to store the number of repeats. What do you guys think about me doing something like this (this is pseudocode):

Code: [Select]
import thang.txt
a = 0
b = 19
dictionary = {}
for b <= length_of_file:
     c = get_characters(a,b)
     if c in dictionary{} = false
          dictionary['c'] = 1
     elsif
          d = dictionary['c']
          d += 1
          dictionary['c'] = d
     end
     a += 1
     b += 1
end
I'm working with Python, I think, so the coding's a bit Python-ey. So basically, I take twenty characters from the start and see if it exists in the dictionary/hash array/hash table. If it doesn't, it creates it and sets the value to 1, meaning there's at least one instance of it in the sequence. Then, it sets the values to check (a and b) to +1. So it'll check characters 2-21 if the sequence exists in the dictionary and does the same stuff for it.

Does anyone knowledgeable of Python's limitations know if such a feat is possible in Python?

****
Rep:
Level 83
I use Python. This is easily possible. And since this is pseudo code, I won't remark on small errors I see.
I don't know if that solution would work, but seeing how easy it is to try things in Python, I am sure you will get it right.

*
A Random Custom Title
Rep:
Level 96
wah
Oh, you use Python? :o Do you mind getting on the IRC and helping me code this a bit? I know basically what I want to do/try, but I don't know the exact syntax. :(

****
Rep:
Level 83
Ok, I am on for a while

*
A Random Custom Title
Rep:
Level 96
wah
Okay, I've created my code, and I've run into a problem, but I think I've diagnosed the problem, but I don't know how to fix it.

Code: [Select]
while (kmer_end <= SEQUENCE_LENGTH):                               # While the end of the k-mer does not exceed the file length, do loop:
    read_kmer = sequence_to_be_read.read(KMER_LENGTH)              # Get the k-mer to be read.
    if (read_kmer in repeat_sequences == False):                   # If the k-mer is not in the dictionary,
        repeat_sequences[read_kmer] = 1                            # Add it to the dictionary and set the occurence to 1.
    elif (read_kmer in repeat_sequences == True):                  # If the k-mer IS in the dictionary,
        repeat_sequences[read_kmer] += 1                           # And set the actual count to that number.

And yeah, I've heavily commented on the whole script, lol. In fact, basically every line has a comment on it. Anyways, the problem is that the k-mers aren't being written. I've even found out that the program can print out all the possible k-mers. The AATGC, GTAAC, whatever. My script also identifies correctly whether or not that the k-mers are actually in the dictionary. I initialized the dictionary with one k-mer that I knew occurred 3 times, and the script found it all "True" for the three instances. But everything else failed. I don't know how I can properly write it so that the dictionary will write the keys correctly.

I've tried converting the read_kmer's to string, but it doesn't work. Either that or they're already strings.

EDIT: It's the latter.
« Last Edit: June 22, 2010, 08:25:55 PM by mastermoo420 »

****
Rep:
Level 83
Try this. (This is slightly more Python-ic then your original code)
Code: [Select]
# Opens a file with the genetic information
sequence_to_be_read = open("whateverfile.txt", "r")
# Reads the first line in the text file and takes off the newline character
read_kmer = sequence_to_be_read.readline().strip("\n")
# While not at the end of the file, do loop
while read_kmer:
    # If the genetic information isn't in your dictionary
    if read_kmer not in repeat_sequences:
        # Add it to the dictionary
        repeat_sequences[read_kmer] = 1
    # Else, if genetic information is in the dictionary
    else:
        # Add 1 to number for that genetic information
        repeat_sequences[read_kmer] += 1
    # Read next line in file and take out the newline character
    read_kmer = sequence_to_be_read.readline().strip("\n")

*
A Random Custom Title
Rep:
Level 96
wah
Yeah, I got help earlier from elsewhere because you were gone, but they told me to get rid of the "== True" in the "in repeat_sequences == True" part, and that fixed my problem. So, so far, the script is going great! There are some things I need to work to change, though.

First, in the file that I'm going to fully run the code through (well, this file is the test file, and it's only 100,000 letters), there are 50 letters per line, and there's a space after each set of 50. If the script doesn't read it properly, then it's going to appear with \n's everywhere. Wait, I just looked at your code, redyugi, and there's that .strip("\n") part. Will that apply change make it work for the entire .txt file? :o

Also, I need to create an option whether or not to import an already existing dictionary considering there are 23 full files I need to run the program through. Why 23? 23 chromosomes of a human. P: But this should be pretty easy.

Also, for your "while read_kmer:," wouldn't that make it so that if the length of the k-mer is 5 and you're 4 characters away from the end, it'll still try and run the code?
« Last Edit: June 22, 2010, 10:29:43 PM by mastermoo420 »

****
Rep:
Level 83
Yeah, I got help earlier from elsewhere because you were gone, but they told me to get rid of the "== True" in the "in repeat_sequences == True" part, and that fixed my problem. So, so far, the script is going great! There are some things I need to work to change, though.
Good

Quote
First, in the file that I'm going to fully run the code through (well, this file is the test file, and it's only 100,000 letters), there are 50 letters per line, and there's a space after each set of 50. If the script doesn't read it properly, then it's going to appear with \n's everywhere. Wait, I just looked at your code, redyugi, and there's that .strip("\n") part. Will that apply change make it work for the entire .txt file? :o
What that does is that it strips your current line of "\n". So for the blank line, it turns "\n" to "". This would end the loop I set above because "" returns a Boolean value of False.

Quote
Also, I need to create an option whether or not to import an already existing dictionary considering there are 23 full files I need to run the program through. Why 23? 23 chromosomes of a human. P: But this should be pretty easy.
Yes, this would be rather easy. Ask if you need help though.

Quote
Also, for your "while read_kmer:," wouldn't that make it so that if the length of the k-mer is 5 and you're 4 characters away from the end, it'll still try and run the code?
That makes it loop if the line it is on isn't the last line or empty. It does not check the length. You would have to do that yourself, however that is a simple "if" statement so...

*
A Random Custom Title
Rep:
Level 96
wah
Yeah, my original coding for that kinda takes care of that. I guess I'll have to add in stripping (oh-la-la) the text. Also, would len(sequence_to_be_read) return a value that displays the number of characters in the file?

****
Rep:
Level 83
I am not sure. Try it in IDLE(or whatever IDE you are using). That is what it is there for. I test out small ideas like that all the time

*
A Random Custom Title
Rep:
Level 96
wah
I use unix at my internship so, lol. Yeh. But yeah, I do test out small things, too. I get the general concept down and then I implement.

****
Rep:
Level 83
Oh. lol. Well I don't have Python on my desktop so I can't really try that now either.

*
A Random Custom Title
Rep:
Level 96
wah
Okay, redyugi, I see how the .strip("\n") works now, but is there a way to change it? Because it reads each line individually, but, say I had this:
Code: [Select]
AAAAA
TTTTT
GGGG
CCCCC
If the given length of the k-mer is 4, then it'd just read:
AAAA
AAAA
TTTT
TTTT

because it doesn't connect the end of a line to the beginning of the next.

****
Rep:
Level 83
Well I suppose you could do this. I don't think this is very efficient but...
Code: [Select]
# Opens the file
sequence_to_be_read = open("whatever.txt", "r")
# Entire file in one long string, taking out the "\n" and replacing it with " "
length_of_file = sequence_to_be_read.read().replace("\n", " ")
# Insert the rest of the code here
# When you need to find the length, just use this
# len(length_of_file.replace(" ", "")
# Insert this on last line of the loop
# Splits the string into an array "AAAA TTTT" --> ["AAAA", "TTTT"]
temp = length_of_file.split()
# Delete the line you just read
del temp[0]
# Change it back to string ["TTTT", "GGGG"] --> "TTTT GGGG"
length_of_file = " ".join(temp)

*
A Random Custom Title
Rep:
Level 96
wah
Actually, I got everything working! ;8 #python at irc.freenode.net has a lot of smart people P: For now, I'm done with my work until my mentor answers my questions. Most of the actual files that I'll have to work with contain some weird stuff with the name of the file at the beginning (like ">chr1.fa"), but I'm happy that the program prunes out all the repetitions lower than a certain number, so those are just thrown out! ;8

EDIT: I think it is finished! Time to run it through a whole, actual chromosome sequence.

EDIT: Memory error. ;9 I should have seen this coming.
http://paste.pocoo.org/show/228995/

That's what I had, but somebody told me to use streaming (Generators specificall) for large files. Help? P:

EDIT: This is what I got link, but it runs the script in a split second and there's nothing in the dictionary that's written.
« Last Edit: June 23, 2010, 08:44:05 PM by mastermoo420 »

*
A Random Custom Title
Rep:
Level 96
wah

*
A Random Custom Title
Rep:
Level 96
wah
Bamp with update. Like I said, I had a memory error earlier. So, I decided to fix it by partitioning the original file into 10 parts.
http://paste.pocoo.org/show/229537/

Yay! I hope it works. I'm trying to run it through a chromosome right now.

*
Rep:
Level 98
2010 Best Veteran2014 Best Use of Avatar and Signature Space2014 Best IRC Chatterbox2014 Most Mature Member2014 Best Writer2014 Best Counsel2014 Favorite Staff Member2014 King of RMRK2013 Favorite Staff MemberSecret Santa 2013 ParticipantFor the great victory in the Breakfast War.Secret Santa 2012 Participant2011 Best Writer2011 Best Counsel2010 Funniest Member2010 Best Writer
copy a part of the code

ctrl+f

paste

Search


????

done
you awoke in a burning paperhouse
from the infinite fields of dreamless sleep

*
A Random Custom Title
Rep:
Level 96
wah
???

Oh, btw, I found out that this works. Problem is that it works at a horrible speed. :(

*
A Random Custom Title
Rep:
Level 96
wah