WEBVTT

1
00:00:00.090 --> 00:00:01.230
<v Instructor>In this lesson,</v>

2
00:00:01.230 --> 00:00:03.660
we will learn about reverse engineering.

3
00:00:03.660 --> 00:00:06.240
Reverse engineering breaks down software

4
00:00:06.240 --> 00:00:09.930
or hardware components to understand their structure,

5
00:00:09.930 --> 00:00:13.320
functionality, and potential vulnerabilities.

6
00:00:13.320 --> 00:00:16.980
Reverse engineering concepts include byte code,

7
00:00:16.980 --> 00:00:21.870
binary code, as well as disassembly and decompilation.

8
00:00:21.870 --> 00:00:26.070
Byte code is a low level representation of code

9
00:00:26.070 --> 00:00:29.490
that can be executed by virtual machines.

10
00:00:29.490 --> 00:00:32.610
Binary refers to machine-level code

11
00:00:32.610 --> 00:00:35.250
that the computer directly executes

12
00:00:35.250 --> 00:00:38.070
and is made up of ones and zeros.

13
00:00:38.070 --> 00:00:42.270
Disassembly is the process of converting binary code

14
00:00:42.270 --> 00:00:47.070
into assembly language to analyze how software operates.

15
00:00:47.070 --> 00:00:51.180
Decompilation goes a step further than disassembly

16
00:00:51.180 --> 00:00:54.600
and attempts to translate executable code

17
00:00:54.600 --> 00:00:58.620
into higher-level language for easier understanding.

18
00:00:58.620 --> 00:01:02.250
Let's learn more about byte code, binary code,

19
00:01:02.250 --> 00:01:05.850
as well as disassembly and decompilation.

20
00:01:05.850 --> 00:01:08.190
First, we have byte code.

21
00:01:08.190 --> 00:01:12.510
Byte code is a low-level representation of code

22
00:01:12.510 --> 00:01:15.450
that is executed by a virtual machine.

23
00:01:15.450 --> 00:01:18.120
It acts as an intermediate form

24
00:01:18.120 --> 00:01:20.730
between high-level performing languages

25
00:01:20.730 --> 00:01:24.510
and machine code, making it platform independent

26
00:01:24.510 --> 00:01:28.530
and easier to execute across different systems.

27
00:01:28.530 --> 00:01:31.530
Unlike assembly language, which is specific

28
00:01:31.530 --> 00:01:35.430
to a processor architecture and directly corresponds

29
00:01:35.430 --> 00:01:38.490
to machine instructions, byte code is designed

30
00:01:38.490 --> 00:01:42.060
to be interpreted or compiled by a virtual machine.

31
00:01:42.060 --> 00:01:44.010
A quick example of byte code

32
00:01:44.010 --> 00:01:46.650
might look like what's on the screen.

33
00:01:46.650 --> 00:01:49.650
In this example, the byte code represents

34
00:01:49.650 --> 00:01:53.220
basic instructions like loading a constant

35
00:01:53.220 --> 00:01:55.560
and storing it in a variable.

36
00:01:55.560 --> 00:01:58.110
Byte code is generally more abstract

37
00:01:58.110 --> 00:02:00.930
and portable compared to assembly language,

38
00:02:00.930 --> 00:02:04.860
which would look more like what's coming onto the screen now

39
00:02:04.860 --> 00:02:07.800
when conducting a similar operation.

40
00:02:07.800 --> 00:02:12.800
So while both byte code and assembly language are low-level,

41
00:02:12.900 --> 00:02:17.340
assembly is directly tied to the CPU architecture

42
00:02:17.340 --> 00:02:21.120
and byte code remains architecture neutral,

43
00:02:21.120 --> 00:02:24.510
relying on a virtual machine to execute it.

44
00:02:24.510 --> 00:02:27.960
In forensic analysis, byte code is important

45
00:02:27.960 --> 00:02:32.520
because it allows analysts to examine software or malware

46
00:02:32.520 --> 00:02:35.040
without needing the original source code,

47
00:02:35.040 --> 00:02:38.700
or worrying about platform-specific details.

48
00:02:38.700 --> 00:02:42.870
For example, in the case of malware written in Java,

49
00:02:42.870 --> 00:02:46.680
a forensic investigator can analyze the byte code

50
00:02:46.680 --> 00:02:50.820
to identify malicious routines or patterns.

51
00:02:50.820 --> 00:02:54.960
In forensic analysis, finding byte code for analysis

52
00:02:54.960 --> 00:02:58.920
typically involves locating compiled byte code files

53
00:02:58.920 --> 00:03:03.920
such as .class files for Java, .pyt files for Python,

54
00:03:05.550 --> 00:03:09.540
or .dll files for .NET.

55
00:03:09.540 --> 00:03:13.590
These files may be extracted from various sources

56
00:03:13.590 --> 00:03:18.590
such as disc images or memory dumps during an investigation.

57
00:03:18.870 --> 00:03:21.480
Additionally, a byte code may be stored

58
00:03:21.480 --> 00:03:25.200
inside bundled archives like JAR files

59
00:03:25.200 --> 00:03:27.120
for Java applications,

60
00:03:27.120 --> 00:03:31.320
or ZIP files containing Python pyt files.

61
00:03:31.320 --> 00:03:34.320
Often, once the byte code is located,

62
00:03:34.320 --> 00:03:37.650
it can be decompiled to reveal its structure

63
00:03:37.650 --> 00:03:41.940
and functionality, allowing investigators to analyze

64
00:03:41.940 --> 00:03:45.450
the software or malware's behavior and intent.

65
00:03:45.450 --> 00:03:48.180
Second, we have binary code.

66
00:03:48.180 --> 00:03:51.810
Binary code consists of machine-level instructions

67
00:03:51.810 --> 00:03:54.390
that a computer directly executes,

68
00:03:54.390 --> 00:03:57.360
represented as ones and zeros.

69
00:03:57.360 --> 00:04:00.360
Binary analysis plays an important role

70
00:04:00.360 --> 00:04:01.800
in reverse engineering,

71
00:04:01.800 --> 00:04:05.220
especially when the source code is unavailable.

72
00:04:05.220 --> 00:04:08.880
This process allows investigators to understand

73
00:04:08.880 --> 00:04:12.690
how software, firmware, or malware operates

74
00:04:12.690 --> 00:04:15.750
by examining its raw binary form.

75
00:04:15.750 --> 00:04:19.710
In forensic analysis, a binary code is essential

76
00:04:19.710 --> 00:04:23.730
for evaluating executable files, firmware,

77
00:04:23.730 --> 00:04:25.710
or even operating systems

78
00:04:25.710 --> 00:04:28.650
to uncover potential vulnerabilities,

79
00:04:28.650 --> 00:04:32.220
malicious behaviors, or hidden functionality.

80
00:04:32.220 --> 00:04:35.580
Several tools assist with binary analysis,

81
00:04:35.580 --> 00:04:40.477
including binwalk, hexdump, strace, and ldd.

82
00:04:41.400 --> 00:04:46.080
Binwalk inspects firmware images to identify and extract

83
00:04:46.080 --> 00:04:50.130
embedded components such as compressed archives,

84
00:04:50.130 --> 00:04:53.790
file systems, and executable code,

85
00:04:53.790 --> 00:04:57.390
making it important for reverse engineering firmware.

86
00:04:57.390 --> 00:05:01.650
Next, hexdump is used to display binary files

87
00:05:01.650 --> 00:05:04.800
in a human-readable hexadecimal format,

88
00:05:04.800 --> 00:05:07.830
allowing analysts to examine the structure

89
00:05:07.830 --> 00:05:09.753
and content of a file.

90
00:05:10.767 --> 00:05:14.043
For example, during a forensic investigation

91
00:05:14.043 --> 00:05:16.050
of an embedded device's firmware,

92
00:05:16.050 --> 00:05:19.898
an analyst may use binwalk to extract

93
00:05:19.898 --> 00:05:21.570
and inspect binary components,

94
00:05:21.570 --> 00:05:25.080
potentially locating embedded malicious code.

95
00:05:25.080 --> 00:05:29.310
Then, hexdump can help investigators manually inspect

96
00:05:29.310 --> 00:05:34.020
the content of these binary files to identify patterns,

97
00:05:34.020 --> 00:05:37.410
headers, or anomalies that suggest tampering

98
00:05:37.410 --> 00:05:39.150
or malware insertion.

99
00:05:39.150 --> 00:05:42.870
The next tool, strace, tracks system calls

100
00:05:42.870 --> 00:05:46.860
made by a running binary during execution,

101
00:05:46.860 --> 00:05:50.160
revealing how it interacts with the system

102
00:05:50.160 --> 00:05:53.250
and whether it performs suspicious actions

103
00:05:53.250 --> 00:05:55.860
like unauthorized file access.

104
00:05:55.860 --> 00:06:00.240
And finally, ldd identifies the shared libraries

105
00:06:00.240 --> 00:06:03.300
a binary relies on, helping analysts

106
00:06:03.300 --> 00:06:05.910
understand the program's dependencies

107
00:06:05.910 --> 00:06:10.110
and detect any unusual or malicious modifications

108
00:06:10.110 --> 00:06:11.820
linked to libraries.

109
00:06:11.820 --> 00:06:14.250
Third, we have disassembly.

110
00:06:14.250 --> 00:06:18.780
Disassembly is the process of converting a binary code

111
00:06:18.780 --> 00:06:22.560
into assembly language, which provides low-level

112
00:06:22.560 --> 00:06:26.760
human-readable representation of machine instructions.

113
00:06:26.760 --> 00:06:29.340
This allows investigators to understand

114
00:06:29.340 --> 00:06:33.090
how software operates at a fundamental level,

115
00:06:33.090 --> 00:06:36.390
without access to the original source code.

116
00:06:36.390 --> 00:06:40.200
Disassembling malware or proprietary software

117
00:06:40.200 --> 00:06:43.800
can reveal hidden features, vulnerabilities,

118
00:06:43.800 --> 00:06:45.660
or malicious behavior.

119
00:06:45.660 --> 00:06:49.950
In forensic analysis, disassembly provides insights

120
00:06:49.950 --> 00:06:53.550
into how an application interacts with the system,

121
00:06:53.550 --> 00:06:56.310
executes instructions, and performs

122
00:06:56.310 --> 00:06:58.950
potentially harmful operations.

123
00:06:58.950 --> 00:07:03.780
Popular tools for disassembly include IDA Pro, Ghidra,

124
00:07:03.780 --> 00:07:07.260
and OllyDbg, or Olly Debug.

125
00:07:07.260 --> 00:07:10.650
These tools help analysts break down binaries

126
00:07:10.650 --> 00:07:14.010
into assembly language, revealing the sequence

127
00:07:14.010 --> 00:07:16.830
of operations a program performs.

128
00:07:16.830 --> 00:07:20.550
For example, in a forensic investigation,

129
00:07:20.550 --> 00:07:23.490
if an unknown binary is discovered,

130
00:07:23.490 --> 00:07:27.300
an analyst might use Ghidra or IDA Pro

131
00:07:27.300 --> 00:07:32.280
to disassemble the file and investigate its system calls,

132
00:07:32.280 --> 00:07:36.000
memory usage, and external communication

133
00:07:36.000 --> 00:07:39.150
to uncover any suspicious activity,

134
00:07:39.150 --> 00:07:42.420
such as attempts to steal sensitive information

135
00:07:42.420 --> 00:07:45.360
or bypass security mechanisms.

136
00:07:45.360 --> 00:07:48.930
In addition to pure disassembly tools,

137
00:07:48.930 --> 00:07:52.770
debuggers like OllyDbg for Windows applications

138
00:07:52.770 --> 00:07:56.250
and GDB, the GNU Project Debugger

139
00:07:56.250 --> 00:07:58.410
for Linux and Unix systems,

140
00:07:58.410 --> 00:08:02.550
also play a critical role in analyzing binaries.

141
00:08:02.550 --> 00:08:05.760
These tools allow investigators to step through

142
00:08:05.760 --> 00:08:10.380
the execution of a disassembled program in real time,

143
00:08:10.380 --> 00:08:14.460
making it easier to identify specific behaviors,

144
00:08:14.460 --> 00:08:17.790
vulnerabilities, or malicious actions.

145
00:08:17.790 --> 00:08:20.640
A key feature in these debuggers

146
00:08:20.640 --> 00:08:24.210
is the use of break points, which allow analysts

147
00:08:24.210 --> 00:08:27.210
to pause the execution of a program

148
00:08:27.210 --> 00:08:29.820
at specific points in the code.

149
00:08:29.820 --> 00:08:33.750
By setting break points, investigators can closely examine

150
00:08:33.750 --> 00:08:37.890
the state of the program, including memory, variables,

151
00:08:37.890 --> 00:08:41.160
and register values at critical moments.

152
00:08:41.160 --> 00:08:44.760
This helps pinpoint where suspicious behavior occurs

153
00:08:44.760 --> 00:08:47.910
or where vulnerabilities may be exploited.

154
00:08:47.910 --> 00:08:51.780
Fourth and last, we have decompilation.

155
00:08:51.780 --> 00:08:54.900
Decompilation is the process of translating

156
00:08:54.900 --> 00:08:58.050
executable binary or byte code

157
00:08:58.050 --> 00:09:01.771
back into high-level programming language

158
00:09:01.771 --> 00:09:04.020
such as C, Python or Java.

159
00:09:04.020 --> 00:09:07.020
Unlike disassembly, which converts binary

160
00:09:07.020 --> 00:09:10.950
to assembly language, decompilation aims to produce

161
00:09:10.950 --> 00:09:13.680
higher-level human-readable code

162
00:09:13.680 --> 00:09:17.370
that closely resembles the original source code.

163
00:09:17.370 --> 00:09:21.180
This is especially useful in forensic analysis

164
00:09:21.180 --> 00:09:23.970
when the goal is to understand the logic

165
00:09:23.970 --> 00:09:26.940
and flow of a program in a way that is

166
00:09:26.940 --> 00:09:31.380
easier to interpret than raw machine or assembly code.

167
00:09:31.380 --> 00:09:35.070
Languages like Java, Python, and JavaScript

168
00:09:35.070 --> 00:09:39.060
are among the most likely to be successfully decompiled

169
00:09:39.060 --> 00:09:42.330
because they compile to intermediate byte code,

170
00:09:42.330 --> 00:09:45.660
which retains significant structural information

171
00:09:45.660 --> 00:09:47.730
from the original source.

172
00:09:47.730 --> 00:09:51.210
For example, Java compiles to byte code

173
00:09:51.210 --> 00:09:54.900
for the Java virtual machine, and tools like

174
00:09:54.900 --> 00:09:58.830
Class File Reader can effectively reverse engineer

175
00:09:58.830 --> 00:10:01.890
Java byte code back to Java source code.

176
00:10:01.890 --> 00:10:06.870
Similarly, Python byte code produced in .pyt files

177
00:10:06.870 --> 00:10:09.900
is often successfully decompiled

178
00:10:09.900 --> 00:10:13.050
using tools like Uncompyle6.

179
00:10:13.050 --> 00:10:17.130
Next, JavaScript, though not traditionally compiled,

180
00:10:17.130 --> 00:10:21.000
can often be restored from obfuscated forms

181
00:10:21.000 --> 00:10:23.220
using deobfuscators.

182
00:10:23.220 --> 00:10:27.600
Finally, tools like the Java Executable Compiler

183
00:10:27.600 --> 00:10:31.620
and Ghidra are frequently used for decompilation

184
00:10:31.620 --> 00:10:34.230
and in forensic investigations

185
00:10:34.230 --> 00:10:38.370
because they help reverse engineer suspicious applications,

186
00:10:38.370 --> 00:10:40.860
uncover hidden malicious functions,

187
00:10:40.860 --> 00:10:43.650
or reveal potential exploits.

188
00:10:43.650 --> 00:10:48.600
Overall, decompilation simplifies the analysis process,

189
00:10:48.600 --> 00:10:52.260
allowing investigators to identify patterns

190
00:10:52.260 --> 00:10:54.900
or reconstruct the original intent

191
00:10:54.900 --> 00:10:57.120
of the program more easily.

192
00:10:57.120 --> 00:11:01.560
So remember, reverse engineering is all about

193
00:11:01.560 --> 00:11:04.890
breaking down software or hardware components

194
00:11:04.890 --> 00:11:08.580
to understand their structure, functionality,

195
00:11:08.580 --> 00:11:10.860
and potential vulnerabilities,

196
00:11:10.860 --> 00:11:14.820
it involves analyzing concepts like byte code,

197
00:11:14.820 --> 00:11:19.320
binary code, disassembly, and decompilation.

198
00:11:19.320 --> 00:11:23.700
Byte code is an intermediate representation of code

199
00:11:23.700 --> 00:11:27.660
that virtual machines execute, while binary code

200
00:11:27.660 --> 00:11:30.660
consists of machine-level instructions

201
00:11:30.660 --> 00:11:33.480
executed directly by the computer.

202
00:11:33.480 --> 00:11:37.710
Disassembly converts binary code into human-readable

203
00:11:37.710 --> 00:11:41.850
assembly language, which helps investigators understand

204
00:11:41.850 --> 00:11:45.120
how software operates at a fundamental level.

205
00:11:45.120 --> 00:11:49.110
Finally, decompilation goes beyond disassembly

206
00:11:49.110 --> 00:11:52.320
by translating executable code back into

207
00:11:52.320 --> 00:11:54.780
high-level programming languages,

208
00:11:54.780 --> 00:11:57.840
making it easier to understand and read.

209
00:11:57.840 --> 00:12:01.350
These techniques are used in forensic analysis

210
00:12:01.350 --> 00:12:05.160
for uncovering malicious behavior, hidden functions,

211
00:12:05.160 --> 00:12:07.503
or vulnerabilities in software.

