Skip to content

Commit 2a2a4d4

Browse files
authored
Merge pull request #52 from khoivan88/patch-2
Suggestion for file_parsing documents
2 parents bc78d1c + e354fd3 commit 2a2a4d4

File tree

1 file changed

+71
-40
lines changed

1 file changed

+71
-40
lines changed

_episodes/02-file_parsing.md

Lines changed: 71 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ or similar to this if you are on Windows
5050
~~~
5151
{: .output}
5252

53-
Notice that the file paths are different for these two systems. The Windows system uses a forward slash ('\\'), while Mac and Linux use a backslash ('/') for filepaths.
53+
Notice that the file paths are different for these two systems. The Windows system uses a backslash ('\\'), while Mac and Linux use a forward slash ('/') for filepaths.
5454

5555
When we write a script, we want it to be usable on any operating system, thus we will use a python module called `os.path` that will allow us to define file paths in a general way.
5656

@@ -69,7 +69,7 @@ data/outfiles/ethanol.out
6969
~~~
7070
{:. .output}
7171

72-
Here, we have specified that our filepath contains the 'data' and 'outfiles' directory, and the `os.path` module has made this into a filepath that is usable by our system. If you are on Windows, you will instead see that a forward slash is used.
72+
Here, we have specified that our filepath contains the 'data' and 'outfiles' directory, and the `os.path` module has made this into a filepath that is usable by our system. If you are on Windows, you will instead see that a backslash is used.
7373

7474
> ## Absolute and relative paths
7575
> File paths can be *absolute*, or *relative*.
@@ -108,6 +108,18 @@ outfile.close()
108108
~~~
109109
{: .language-python}
110110

111+
> ## An alternative way to open a file.
112+
> Alternatively, you can open a file using `context-manager`. In this case, the context manager will automatically handle closing of the file. To use a context manager to open and close the file, you use the word `with`, and put everything you want to be done while the file is open in an indented block.
113+
> ~~~
114+
> with open(ethanol_file,"r") as outfile:
115+
> data = outfile.readlines()
116+
> ~~~
117+
> {: .language-python}
118+
>
119+
> This is often the preferred way to deal with files because you do not have to remember to close the file.
120+
{: .callout}
121+
122+
111123
> ## Check Your Understanding
112124
> Check that your file was read in correctly by determining how many lines are in the file.
113125
>> ## Answer
@@ -129,7 +141,7 @@ Let's take a look at what's in the file.
129141
130142
~~~
131143
for line in data:
132-
print(line)
144+
print(line)
133145
~~~
134146
{: .language-python}
135147
@@ -196,12 +208,24 @@ print(words)
196208
197209
From this `print` statement, we now see that we have a list called words, where we have split `energy_line`. The energy is actually the fourth element of this list, so we can now save it as a new variable.
198210
199-
```
211+
```python
200212
energy = words[3]
201213
print(energy)
202214
```
203215
{: .language-python}
204216
217+
> ## Python negative indexing
218+
> We also recogize that "energy" is the last element of the list. Therefore, an alternative way to assign `energy` is:
219+
> ```python
220+
> energy = words[-1]
221+
> print(energy)
222+
> ```
223+
>
224+
> In the example above, the index value of `-1` gives the last element, and `-2` would give the second last element of a list, and so on. An excelent tutorial on Python list accessed by index can be found [here](https://realpython.com/python-lists-tuples/#list-elements-can-be-accessed-by-index)
225+
{: .callout}
226+
227+
228+
205229
```
206230
-154.09130176573018
207231
```
@@ -237,48 +261,48 @@ energy = float(words[3])
237261
>## Exercise on File Parsing (should we move this to the end?)
238262
Use the provided sapt.out file. In this output file, the program calculates the interaction energy for an ethene-ethyne complex. The output reports four interaction energy components: electrostatics, induction, exchange, and dispersion. Parse each of these energies, in kcal/mole, from the output file. (Hint: study the file in a text editor to help you decide what to search for.) Calculate the total interaction energy by adding the four components together. Your code's output should look something like this:
239263
> ~~~
240-
> Electrostatics : -2.25850118 kcal/mole
241-
> Exchange : 2.27730198 kcal/mole
242-
> Induction : -0.5216933 kcal/mole
243-
> Dispersion : -0.9446677 kcal/mole
244-
> Total Energy : 1.4475602000000003 kcal/mole
264+
> Electrostatics : -2.25850118 kcal/mol
265+
> Exchange : 2.27730198 kcal/mol
266+
> Induction : -0.5216933 kcal/mol
267+
> Dispersion : -0.9446677 kcal/mol
268+
> Total Energy : 1.4475602000000003 kcal/mol
245269
> ~~~
246270
> {: language.python}
247271
>
248272
> > ## Solution
249273
>>
250274
>> This is one possible solution for the SAPT parsing exercise
251275
>> ~~~
252-
>> saptout = open('SAPT.out','r')
253-
>> saptlines = saptout.readlines()
254-
>> important_lines=[]
255-
>> energies=[]
256-
>> for line in saptlines:
257-
>> if 'Electrostatics ' in line:
258-
>> electro_line = line
259-
>> important_lines.append(electro_line)
260-
>> if 'Exchange ' in line:
261-
>> exchange_line = line
262-
>> important_lines.append(exchange_line)
263-
>> if 'Induction ' in line:
264-
>> induction_line = line
265-
>> important_lines.append(induction_line)
266-
>> if 'Dispersion ' in line:
267-
>> dispersion_line = line
268-
>> important_lines.append(dispersion_line)
269-
>>
270-
>> #print(important_lines)
271-
>>
276+
>> important_lines = []
277+
>> energies = []
278+
>>
279+
>> with open('SAPT.out','r') as saptout:
280+
>> for line in saptout:
281+
>> if 'Electrostatics ' in line:
282+
>> electro_line = line
283+
>> important_lines.append(electro_line)
284+
>> if 'Exchange ' in line:
285+
>> exchange_line = line
286+
>> important_lines.append(exchange_line)
287+
>> if 'Induction ' in line:
288+
>> induction_line = line
289+
>> important_lines.append(induction_line)
290+
>> if 'Dispersion ' in line:
291+
>> dispersion_line = line
292+
>> important_lines.append(dispersion_line)
293+
>>
294+
>> # print(important_lines)
295+
>>
272296
>> for line in important_lines:
273-
>> words = line.split()
274-
>> #print(words)
275-
>> energy_type = words[0]
276-
>> energy_kcal = float(words[3])
277-
>> energies.append(energy_kcal)
278-
>> print(energy_type, ':', energy_kcal, 'kcal/mole')
279-
>>
280-
>> total_energy=energies[0]+energies[1]+energies[2]+energies[3]
281-
>> print('Total Energy', ':', total_energy, 'kcal/mole')
297+
>> words = line.split()
298+
>> # print(words)
299+
>> energy_type = words[0]
300+
>> energy_kcal = float(words[3])
301+
>> energies.append(energy_kcal)
302+
>> print('{} : {} kcal/mol'.format(energy_type, energy_kcal))
303+
>>
304+
>> total_energy = sum(energies)
305+
>> print('Total Energy : {} kcal/mol'.format(total_energy))
282306
>> ~~~
283307
>> {: .language-python}
284308
> {: .solution}
@@ -300,6 +324,13 @@ for linenum, line in enumerate(list_name):
300324
301325
In this notation, there are now *two* variables you can use in your loop commands, `linenum` (which can be named something else) will keep up with what iteration you are on in the loop, in this case what line you are on in the file. The variable `line` (which could be named something else) functions exactly as it did before, holding the actual information from the list. Finally, instead of just giving the list name you use `enumerate(list_name)`.
302326
327+
> ## `Enumerate` with index other than 0:
328+
> `enumerate(list_name)` will start with 0-index so the first line will be label as '0', to change this behavior, use `start` variable in enumerate. For example, to start with index of "1" you can do:
329+
> ```python
330+
> for linenum, line in enumerate(data, start=1):
331+
> # do something with 'linenum' and 'line'
332+
{: .callout}
333+
303334
This block of code searches our file for the line that contains "Center" and reports the line number.
304335
```
305336
for linenum, line in enumerate(data):
@@ -313,7 +344,7 @@ for linenum, line in enumerate(data):
313344
Center X Y Z Mass
314345
```
315346
{: .output}
316-
Now we know that this is line 77 in our file (remember that you start counting at zero!).
347+
Now we know that this is line 77 in our file (remember that you start counting at zero!).
317348
318349
>## Check Your Understanding
319350
>What would be printed if you entered the following:
@@ -345,6 +376,6 @@ Now we know that this is line 77 in our file (remember that you start counting a
345376
{: .challenge}
346377
347378
## A final note about regular expressions
348-
Sometimes you will need to match something more complex than just a particular word or phrase in your output file. Sometimes you will need to match a particular word, but only if it is found at the beginning of a line. Or perhaps you will need to match a particular pattern of data, like a capital letter followed by a number, but you won't know the exact letter and number you are looking for. These types of matching situations are handled with something called *regular expressions* which is accessed through the python module `re`. While using regular expressions is outside the scope of this tutorial, they are very useful and you might want to learn more about them in the future. A tutorial can be found at _______.
379+
Sometimes you will need to match something more complex than just a particular word or phrase in your output file. Sometimes you will need to match a particular word, but only if it is found at the beginning of a line. Or perhaps you will need to match a particular pattern of data, like a capital letter followed by a number, but you won't know the exact letter and number you are looking for. These types of matching situations are handled with something called *regular expressions* which is accessed through the python module `re`. While using regular expressions (regex) is outside the scope of this tutorial, they are very useful and you might want to learn more about them in the future. A tutorial can be found at [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/2e/chapter7/) book. A great test site for regex is [here](https://regex101.com/)
349380
350381
{% include links.md %}

0 commit comments

Comments
 (0)