The full name of this repository is Distributed Matrix Computation Playground. As the name say, this repository will be used to record some methods of distributed matrix computation.
We will focus on distributed matrix multiplication at first. We employ the matrix organization from SystemDS.
We have simplified the struct in DistributedMacPlayground from SystemDS. We only focus on the matrix computation in SystemDS rather than the whole SystemDS ( included instruction parse, compiler optimization, etc.)
If you only want to run it in the local machine, you can find test code which run in local from src/test/java
.
We use the JUnit to write some test code for every distributed matrix multiplication.
Before you run test code, you should make sure your project have the path src/test/java/cache/Cpmm
, src/test/java/cache/Mapmm
, src/test/java/cache/PMapmm
, etc. (We will fix it if we have time.)
You should package the project by using maven
if you want to run the code in spark cluster. We can run the project from Main class and change the data and different distributed matrix multiplication by changing the corresponded parameter.
You can run the next command to submit the application to the spark cluster. --deploy-mode
can be cluster
or client
. The value of --master
can be found in the website of your spark cluster.
spark-submit --deploy-mode cluster\
--master spark://6e1929967e39:7077 \
--class com.distributedMacPlayground.Main \
[packaged .jar file (you need to make sure it has enough dependencies.)] \
[main parameters]
The example of full command:
bin/spark-submit --deploy-mode cluster\
--conf spark.driver.host=hadoop101 \
--master spark://hadoop101:7077 \
--class com.distributedMacPlayground.Main \
DistributeMacPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar \
-mmtype CPMM \
-datatype index \
-in1 hdfs://hadoop101:8020/user/hadoop/100x300x10_matrix_index.csv \
-in2 hdfs://hadoop101:8020/user/hadoop/300x100x10_matrix_index.csv
bin/spark-submit --deploy-mode cluster\
--conf spark.driver.host=hadoop101 \
--master spark://hadoop101:7077 \
--class com.distributedMacPlayground.Main \
DistributeMacPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar \
-mmtype RMM \
-datatype data \
-row 10000 \
-col 10000 \
-middle 100000 \
-blockSize 1000 \
-in1 hdfs://hadoop101:8020/user/hadoop/10000x100000x0.1.csv \
-in2 hdfs://hadoop101:8020/user/hadoop/100000x10000x0.1.csv \
-outputempty true \
-CACHETYPE left
some bug fix Modify the conf file of spark,add
spark.local.dir /tmp/spark-temp
——resovle the problem about out of disk.spark.local.dir /opt/module/spark/tmp spark.executor.memory 30925M spark.driver.memory 30925M spark.driver.maxResultSize 45G spark.driver.cores 4 spark.driver.host hadoop101 spark.ui.enabled true
local submit test:
bin/spark-submit --deploy-mode cluster \
--master spark://230153b45b98:7077 \
--class com.distributedMacPlayground.Main \
DistributeMacPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar \
-mmtype cpmm \
-datatype data \
-row 30000 \
-col 20000 \
-middle 1000 \
-blockSize 1000 \
-in1 hdfs://230153b45b98:9000/user/root/10Kx100K.csv \
-in2 hdfs://230153b45b98:9000/user/root/100Kx10K.csv \
-save hdfs://230153b45b98:9000/user/root
GNMF submit test:
bin/spark-submit --deploy-mode cluster\
--master spark://hadoop101:7077 \
--class com.distributedMacPlayground.GNMF \
DistributeMacPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar \
-type RMM \
-path hdfs://hadoop101:8020/user/hadoop/soc-buzznet.mtx \
-row 101163 \
-col 101163 \
-middle 1000 \
-BLOCKSIZE 1000 \
-iter 1
The main parameters can write by considering the follow table.
Parameter Name | Type | Default | Necessary | Description |
---|---|---|---|---|
-mmType | String | NULL | True | The method of distributed matrix multiplication. You can choose CpMM , MapMM , PMapMM and RMM . |
-dataType | String | NULL | True | The type of input data. You should make sure your input file are .csv files. The value of this parameter can be data or index . |
-in1 | String | NULL | True | The file path of left matrix. You must be the hdfs type like hdfs://localhost:9000/[your hdfs path] |
-in2 | String | NULL | True | The file path of right matrix. You must be the hdfs type like hdfs://localhost:9000/[your hdfs path] |
-row | int | NULL | False if dataType=index ,True if dataType=data |
The output matrix row. |
-col | int | NULL | False if dataType=index ,True if dataType=data |
The output matrix col. |
-middle | int | NULL | False if dataType=index ,True if dataType=data |
The common dimension of the two matrix. |
-blockSize | int | NULL | False if dataType=index ,True if dataType=data |
The size of block partitioned from large distributed matrix . |
-cacheType | String | left | False | Use in MapMM. The value of this parameter can be left or right . Broadcast the left matrix if cacheType=left . Broadcast the right matrix if cacheType=right |
-aggType | String | multi | False | Use in MapMM. The value of this parameter can be multi or sigle . It means whether the input matrix is only a single block. The input matrix usually is multi blocks. |
-outputEmpty | boolean | false | False | Use in MapMM. Whether can output a empty matrix. |
-min | double | 5 | False | Use it when dataType=index . The minimum of the generated matrix. |
-max | double | 10 | False | Use it when dataType=index . The maximum of the generated matrix. |
-sparsity | double | 1 | False | Use it when dataType=index . It range in [0,1] . The sparsity of the generated matrix. |
-seed | int | -1 | False | Use it when dataType=index . The random seed of the generated matrix. |
String | uniform | False | Use it when dataType=index . The random distribution of the generated matrix. The value of this parameter can be uniform , normal or poisson . |
If dataType=data
:
Your input file should be a .csv file. For example, the matrix format like
1,2,3
2,3,4
We will use the value as the distributed matrix value.
If dataType=data
:
The name of your input file should be the format like [row length]x[col length]x[block size]_matrix_index.csv
, such as 12345x321x10_matrix_index.csv
.
And the content of the file should be the block index. For example, 4x4x2_matrix_index.csv
can be expressed by follow format: (Row can be split to 2 blocks and col can be split to 2 blocks, so it has 2x2=4 index. )
1,1
1,2
2,1
2,2
We will use the index to generate partitioned distributed matrix.
core-site.xml
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop101:8020</value>
</property>
<!-- 指定 hadoop 数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop/data</value>
</property>
<!-- 配置 HDFS 网页登录使用的静态用户为 atguigu -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- nn web 端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop101:9870</value>
</property>
<!-- 2nn web 端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop103:50070</value>
</property>
</configuration>
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定 MR 走 shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 ResourceManager 的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop102</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CO
NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP
RED_HOME</value>
</property>
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认
是 true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<!--开启日志聚合-->
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<!--日志聚合hdfs存储路径-->
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<!--hdfs上的日志保留时间-->
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:/opt/module/local_log/local</value>
</property>
<property>
<!--应用执行时存储路径-->
<name>yarn.nodemanager.log-dirs</name>
<value>file:/opt/module/local_log/log</value>
</property>
<property>
<!--应用执行完日志保留的时间,默认0,即执行完立刻删除-->
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
</configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop</value>
</property>
</configuration>