Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sp24 updates #11

Merged
merged 5 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 27 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ automatically by gradle or our IDE.
The focus of this project is to learn about data structures, while working effectively in a group.
In addition, given the small project scope, and the fixed set of requirements that are already
defined (and will not need to be elicited with the use of a Product Owner), the team can
customize the Scrum process learned in CS-HU 271 and focus exclusively on:
customize the Scrum process learned in CS-HU 208 and focus exclusively on:
- creating tasks
- linking commits to task IDs (e.g., `Implements task #123`)
- Test-Driven Development and unit testing. The [starter code](#starter-code) already contains a few [sample unit tests](src/test/java/cs321) that can be [run from the command line](#compile-and-run-the-project-from-the-command-line).
Expand Down Expand Up @@ -317,7 +317,7 @@ length `k`*. The search returns the frequency of occurrence of the query string
zero if it is not found).

We will also create a SQL database (for a specific length `k`) of subsequences and their
frequency to aid in searching. This can be created from the BTree.
frequency to aid in searching. This can be created from the BTree dump files.


## 4. Design Issues
Expand Down Expand Up @@ -366,7 +366,7 @@ We will create three programs:
- second for **searching in the specified BTree** for subsequences of given length. The search program
assumes that the user specified the proper BTree to use depending upon the query length.
- third for **searching in the SQL database** for subsequences of specified length. This database
would be created as a by-product of the first program.
would be created by loading btree dump files using the SQLite interface outside our program.

The main Java classes should be named `GeneBankCreateBTree`, `GeneBankSearchBTree`, and
`GeneBankSearchDatabase`.
Expand Down Expand Up @@ -413,8 +413,8 @@ have the same length as the DNA subsequences in the B-Tree file. The DNA strings
- `[<cache-size>]` is an integer between `100` and `10000` (inclusive) that represents the
maximum number of `BTreeNode` objects that can be stored in memory

- `<SQLite-database-path>` the path to the SQL database created after BTree creation for a
specific sequence length. The name of the database file should be `xyz.k.db` where the sequence
- `<SQLite-database-path>` the path to the SQL database created by loading a dump file using
SQLite's .import command. The name of the database file should be `xyz.k.db` where the sequence
length is `<k>`, and the GeneBank file is `xyz.gbk`. The database file should have been
created by the `GeneBankCreateBTree` program from the BTree it creates

Expand Down Expand Up @@ -528,9 +528,7 @@ needs to be specified as well.
file.

For example, the table below shows the improvement the instructors got on their solution. Note that
your times will be different due to different hardware and differences in the implementation. Also,
we turned off the creation of the database for these timings -- creation of the database will take a
significant amount of time.
your times will be different due to different hardware and differences in the implementation.

| gbk file | degree | sequence length | cache | cache size | cache hit rate | run time |
| -------- | ------ | --------------- | ----- | ---------- | -------------- | -------- |
Expand All @@ -551,20 +549,24 @@ of memory), we were able to bring the time to create the BTree down to only 2m19

## 7. Using a Database

Design a simple database to store the results (sequences and frequencies) from the B-Tree.
We will perform an inorder tree traversal to get the information to store in the database. This
would be done at the end of creating the GeneBank BTree. Then we will create a separate search
program named `GeneBankSearchDatabase` that uses the database instead of the BTree. This is
a common pattern in real life applications, where we may crunch lots of data using a data
Using the dumpfiles from your BTree, load the data into an SQLite database using the
SQLite .import command. See the documentation
[here](https://sqlite.org/cli.html#importing_files_as_csv_or_other_formats).
Then we will create a separate search program named `GeneBankSearchDatabase`
that uses the database instead of the BTree. This is a common pattern in real life
applications, where we may crunch lots of data using a data
structure and then store the results in a database for ease of access.

Note: Since correct dumpfiles are provided in the results folder, GeneBankSearchDatabase
can be started and completed before GeneBankCreateBTree.

```bash
$ ./gradlew createJarGeneBankSearchDatabase
$ java -jar build/libs/GeneBankSearchDatabase.jar --database=<SQLite-database-path>
--queryfile=<query-file>
```

We will use the embedded SQLite database for this project. The SQLite database is fully
We will use the embedded SQLite database for searching the database. The SQLite driver is fully
contained in a jar file that gradle will automatically pull down for us. See the database
example in the section below on how to use SQLite.

Expand Down Expand Up @@ -647,11 +649,18 @@ your project.
Start off by running tests on your machine. If you do need to run them on `onyx` please only
run the smallest test (`test0.gbk`) to avoid overloading the `onyx` server.

## 10. Testing in the Cloud
## 10. Extra Credit: Testing a Large File in the Cloud

Using the AWS Accounts provided earlier in the course, you can run the
large Y-Chromosome file on a cloud instance.

Creating a BTree with the large file is very time intensive. It will take too long to run
unless your cache implementation is efficient and your BTree is well-designed.

To be rewarded with the extra credit, capture a screenshot of check-queries.sh completed
and include it with your submission.

We will setup [Amazon AWS](https://aws.amazon.com/) accounts for each student so that you can run
larger tests in the cloud. **Running our tests on AWS is required so we can all get experience
using the cloud.** :cloud: :smiley:
:cloud: :smiley:

Please see the [AWS
notes](https://docs.google.com/document/d/1v5a0XlzaNyi63TXXKP4BQsPIdJt4Zkxn2lZofVP8qqw/edit?usp=sharing)
Expand Down
22 changes: 22 additions & 0 deletions src/main/java/cs321/create/GeneBankFileReaderInterface.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
package cs321.create;

import java.io.IOException;

/**
* Reads sequences of length k from each position from a GBK GeneBank file from NCBI.
* Methods are available to return sequences of specified length as a long.
*
* @author CS321 Instructors
*/
public interface GeneBankFileReaderInterface {

/**
* Gets the next sequence of a given length as a long
* See SequenceUtils for translation utilities
*
* @return the next sequence, formatted as a long
* @throws IOException in case of failed or interrupted I/O
*/
long getNextSequence() throws IOException;

}
32 changes: 2 additions & 30 deletions src/test/java/cs321/btree/BTreeTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
*
* This is to provide more complicated tests that can be modeled
* after figures in the CLRS textbook.
*
* @author CS321 Instructors
*/
public class BTreeTest {

Expand Down Expand Up @@ -48,36 +50,6 @@ public void cleanUpTests() {
deleteTestFile(testFilename);
}

// HINT:
// instead of checking all intermediate states of constructing a tree
// you can check the final state of the tree and
// assert that the constructed tree has the expected number of nodes and
// assert that some (or all) of the nodes have the expected values
@Test
public void btreeDegree4Test()
{
// //TODO instantiate and populate a bTree object
// int expectedNumberOfNodes = TBD;
//
// // it is expected that these nodes values will appear in the tree when
// // using a level traversal (i.e., root, then level 1 from left to right, then
// // level 2 from left to right, etc.)
// String[] expectedNodesContent = new String[]{
// "TBD, TBD", //root content
// "TBD", //first child of root content
// "TBD, TBD, TBD", //second child of root content
// };
//
// assertEquals(expectedNumberOfNodes, bTree.getNumberOfNodes());
// for (int indexNode = 0; indexNode < expectedNumberOfNodes; indexNode++)
// {
// // root has indexNode=0,
// // first child of root has indexNode=1,
// // second child of root has indexNode=2, and so on.
// assertEquals(expectedNodesContent[indexNode], bTree.getArrayOfNodeContentsForNodeIndex(indexNode).toString());
// }
}

/**
* Test simple creation of an empty BTree.
* An empty BTree has 1 node with no keys and height of 0.
Expand Down
Loading