This notebook demonstrates the interaction of ReproPhylo and of pickled ReproPhylo Project
files with Git.
start a Project
pj = Project('git_demo_files/loci_edited.csv', pickle='git_demo_files/git_demo')
Read data
pj.read_embl_genbank([genbank])
Do alignment
pj.extract_by_locus()
mafft = AlnConf(pj)
pj.align([mafft])
Show last Git action (which was to commit the pickle with the alignment)
pj.last_git_log()
Project
version¶Show Git commits
pj.show_commits()
Revert the Project
pickle to undo last or more actions
Using a hash (commit id) taken from the output of pj.show_commits()
pj = revert_pickle(pj, '22c27d5a25710ec78')
The newer version is not lost, you can toggle back to it the same way
Do another alignment without changing its name from the default, by misake
AlnConf(pj, cline_args=dict(localpair=True, maxiterate=1000))
pj.align([AlnConf])
Realize your mistake, back up the new alignment and fix it's name
new_aln_ob = pj.alignments['mafftDefault']
new_used_method = pj.used_methods['mafftDefault']
new_used_method.method_name = 'SomeNewName'
Revert to get a Project
with the original alignment, named 'mafftDefault'
pj = revert_pickle(pj, 'some hash')
Add the new alignmnet alongside the old one, in the reverted Project
pj.alignments['SomeNewName'] = new_aln_ob
pj.used_methods['SomeNewName'] = new_used_method
Now you have a Project
with both 'mafftDefault'
and 'SomeNewName'
, in both pj.alignments
and in pj.used_methods
The first step here is loading ReproPhylo:
from reprophylo import *
This demo uses Tetillidae GenBank records stored in Tetillidae.gb
and the MT-CO1 locus described in the loci_edited.csv
. To find out more about them, see the Tetillidae use case.
genbank = 'git_demo_files/Tetillidae.gb'
loci_file = 'git_demo_files/loci_edited.csv'
The first step initiates a Project
instance configured for a CO1 partition, as instructed by the loci_edited.csv
file. In addition, the Project
will be saved as a pickle in the binary git_demo_files/git_demo
.
pj = Project('git_demo_files/loci_edited.csv', pickle='git_demo_files/git_demo')
The massages include a credit to the programmers upon which code the Git code in RerpoPhylo is based, the path to the .git
directory containing the newly created Git repository, and the name of the new repository. ReproPhylo uses the name to confirm it matches the pickle file, to prevent mistakes if files are moved around. However, this test will not break if the pickle file is renamed.
Since the Project
is set up, we can read the data from the Tetillidae.gb
file. The file contains several genes but we read only CO1 CDSs. More on this step in the Tetillidae use case.
pj.read_embl_genbank([genbank])
When data is read, the file is commited to the Git repository. To confirm this, it is possible to print the Git log file with print(pj.git_log)
or to print the last log entry:
pj.last_git_log()
To keep going, we need to break the sequence pile according to loci. Again, this is explained in detail in other use cases:
pj.extract_by_locus()
To illustrate version control, this demo focuses on the sequence alignment step, but the complete pipeline is also described in other use cases.
mafft = AlnConf(pj)
The command above created an AlnConf
object. We can print it as a string to see the details of the analysis:
print str(mafft)
pj.used_methods
Since this AlnConf
object was not executed yet, its string representation does not include some of the info, such as the environment it was run in. Also, the used_methods dictionary is empty, because this AlnConf
was not used yet. The next command will make use of this AlnConf
object by passing it to the align
method:
pj.align([mafft])
Several things happened here. First, an aligned CO1 dataset was placed in the alignments
dictionary:
pj.alignments
Second, the pickle of the Project
was updated with the change and commited to the repository:
print pj.last_git_log()
Third, the AlnConf
object was placed in the used_methods
dictionary,
pj.used_methods.keys()
and the string representation of this AlnConf
object now includes execution, program and reference information:
print pj.used_methods['mafftDefault']
The string representation of the AlnConf
object now also includes a skeleton of a Methods section sentence which can be copied into a manuscript and edited. This complete string representation will also appear in the final HTML report that ReproPhylo will produce.
Now lets do something stupid: We will make a new AlnConf
object, with different run parameters, but without changing the name of the AlnConf
object, thus overwriting the previous one. For this alignment step, this is not the end of the world, since it is very quick. However, this will work the same for long analyses, such as tree reconstruction or when there is a lot of data.
new_mafft = AlnConf(pj, cline_args=dict(localpair=True, maxiterate=1000))
pj.align([new_mafft])
Now, checking the used_methods
dictionary, we realize the gravity of our mistake, as the new AlnConf
is stored under the same key as the old one, which is now gone from both the used_methods
and the alignmnet
dictionaries:
print 'Alignments:'
print pj.alignments
print
print 'Used Methods:'
print pj.used_methods
Checking the string representation of the AlnConf
object, which has the same name as the old one, will confirm it shows the new command line, rather than the old one:
print pj.used_methods['mafftDefault']
Since ReproPhylo maintains a Git repository, it is possible to recover from this blunder. We can spot an old version that contains the original alignment step and revert to it. The older versions can be listed with pj.show_commits()
, as below, and they are listed with the newest at the top. The versions, termed 'commits', has hash identifiers, listed at the top of each version's record. The top version is the current one, and the one to revert to is just below, as indicated by the AlnConf
descriptions in each of them:
pj.show_commits()
The hash identifier of a commit is required in order to revert to it. We will revert to the second newest version, with the hash that begins with 9649b5312d09
This is done as follows:
pj = revert_pickle(pj, '22c27d5a25710ec78')
Git has raised no massages, which is a good thing. The Git repository is recognized and will be further maintain. Note that only the pickle reverted, the rest of the files, such as scripts, notebooks and sequence files has not. Also note that if instead of the line above, we run revert_pickle(pj, '9649b5312d09')
, the pickle file is still reverted, but not loaded as a Project
. It is still possible to load it with pj=unpickle_pj('git_demo_files/git_demo')
Now we can confirm the state of our reverted sequence alignment by printing the string representation of the used AlnConf
object again, and see that the command line has changed back to its original form, of MAFFT defaults. The 'Short version' in the top of this page, also shows how to produce a Project
with both alignments coexisting.
print pj.used_methods.keys()
print pj.used_methods['mafftDefault']
If you are not using the Docker ReproPhylo distribution, and you are new to Git, you might get the following error when you start a new Project
with pj=Project('loci_file',pickle='pikle_filename')
:
RuntimeError: Git: set your email with '!git config --global user.email "your_email@example.com"' or disable git (the ! is needed in IPython Notebook. In a terminal, ommit it)
This is because git expects your email to be configured. To configure it, run the following in a terminal:
git config --global user.email "your_email@example.com"
Another possible error when you start a new Project
with pj=Project('loci_file',pickle='pikle_filename')
, as opposed to loading one with unpickle_pj
or with revert_pickle
, can arise because Project
expects pickle
to be a file name that does not yet exist. Otherwise, the following error will be raised,
IOError: Pickle git_demo_files/git_demo exists. If you want to keep using it do pj=unpickle_pj('git_demo_files/git_demo') instead.
to protect you from unintentionally deleting existing projects.
ReproPhylo also tries to make sure that an unpickled, reverted or new Project
can identify its unique Git repository. This connection can be broken if a Git reporsitory already existed in the working directory, which does not belong to the current Project
or if the pickle file was moved independently from the directory in which it is found. The Git repository is found in a directory called .git
, which is a hidden directory. To view hidden files and folders in your file browser, click ctrt+H
. If you want to move the Project
to another location, the folder containing both the .git
directory and the pickle file must be moved as one unit. Should the connection between a Project
and its Git repository be broken, the following error wil be show:
RuntimeError: The Git repository in the CWD does not belong to this project. Either the pickle moved, or this is a preexsisting repo. Try one of the following: Delete the local .Git dir if you don't need it, move the pickle and the notebook to a new work dir, or if possible, move them back to their original location. You may also disable Git by with stop_git().
Note that even if the link between a repository and a project
was broken, the pickle file still contains the full Project
and is totally usable, by passing git=False
.