Skip to content

Commit 9f5325e

Browse files
committed
Merge branch 'develop' into doc3
2 parents b5e3697 + 765735b commit 9f5325e

32 files changed

+238
-180
lines changed

demo/quick_start/data/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
This dataset consists of electronics product reviews associated with
2+
binary labels (positive/negative) for sentiment classification.
3+
4+
The preprocessed data can be downloaded by script `get_data.sh`.
5+
The data was derived from reviews_Electronics_5.json.gz at
6+
7+
http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
8+
9+
If you want to process the raw data, you can use the script `proc_from_raw_data/get_data.sh`.

demo/quick_start/data/get_data.sh

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,11 @@ set -e
1717
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
1818
cd $DIR
1919

20-
echo "Downloading Amazon Electronics reviews data..."
21-
# http://jmcauley.ucsd.edu/data/amazon/
22-
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
20+
# Download the preprocessed data
21+
wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz
2322

24-
echo "Downloading mosesdecoder..."
25-
#https://github.com/moses-smt/mosesdecoder
26-
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
23+
# Extract package
24+
tar zxvf preprocessed_data.tar.gz
2725

28-
unzip master.zip
29-
rm master.zip
30-
echo "Done."
26+
# Remove compressed package
27+
rm preprocessed_data.tar.gz

demo/quick_start/data/pred.list

Lines changed: 0 additions & 1 deletion
This file was deleted.

demo/quick_start/data/pred.txt

Lines changed: 0 additions & 2 deletions
This file was deleted.

demo/quick_start/preprocess.sh renamed to demo/quick_start/data/proc_from_raw_data/get_data.sh

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,26 @@
1616
# 1. size of pos : neg = 1:1.
1717
# 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set.
1818
# 3. distinct train set and test set.
19-
# 4. build dict
2019

2120
set -e
2221

22+
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
23+
cd $DIR
24+
25+
# Download data
26+
echo "Downloading Amazon Electronics reviews data..."
27+
# http://jmcauley.ucsd.edu/data/amazon/
28+
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
29+
echo "Downloading mosesdecoder..."
30+
# https://github.com/moses-smt/mosesdecoder
31+
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
32+
33+
unzip master.zip
34+
rm master.zip
35+
36+
##################
37+
# Preprocess data
38+
echo "Preprocess data..."
2339
export LC_ALL=C
2440
UNAME_STR=`uname`
2541

@@ -29,11 +45,11 @@ else
2945
SHUF_PROG='gshuf'
3046
fi
3147

32-
mkdir -p data/tmp
33-
python preprocess.py -i data/reviews_Electronics_5.json.gz
48+
mkdir -p tmp
49+
python preprocess.py -i reviews_Electronics_5.json.gz
3450
# uniq and shuffle
35-
cd data/tmp
36-
echo 'uniq and shuffle...'
51+
cd tmp
52+
echo 'Uniq and shuffle...'
3753
cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed
3854
cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed
3955

@@ -53,11 +69,11 @@ cat train.pos train.neg | ${SHUF_PROG} >../train.txt
5369
cat test.pos test.neg | ${SHUF_PROG} >../test.txt
5470

5571
cd -
56-
echo 'data/train.txt' > data/train.list
57-
echo 'data/test.txt' > data/test.list
72+
echo 'train.txt' > train.list
73+
echo 'test.txt' > test.list
5874

5975
# use 30k dict
60-
rm -rf data/tmp
61-
mv data/dict.txt data/dict_all.txt
62-
cat data/dict_all.txt | head -n 30001 > data/dict.txt
63-
echo 'preprocess finished'
76+
rm -rf tmp
77+
mv dict.txt dict_all.txt
78+
cat dict_all.txt | head -n 30001 > dict.txt
79+
echo 'Done.'

demo/quick_start/preprocess.py renamed to demo/quick_start/data/proc_from_raw_data/preprocess.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616
"""
17-
1. (remove HTML before or not)tokensizing
17+
1. Tokenize the words and punctuation
1818
2. pos sample : rating score 5; neg sample: rating score 1-2.
1919
2020
Usage:
@@ -76,7 +76,11 @@ def tokenize(sentences):
7676
sentences : a list of input sentences.
7777
return: a list of processed text.
7878
"""
79-
dir = './data/mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
79+
dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
80+
if not os.path.exists(dir):
81+
sys.exit(
82+
"The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists."
83+
)
8084
tokenizer_cmd = [dir, '-l', 'en', '-q', '-']
8185
assert isinstance(sentences, list)
8286
text = "\n".join(sentences)
@@ -104,7 +108,7 @@ def tokenize_batch(id):
104108
num_batch, instance, pre_fix = parse_queue.get()
105109
if num_batch == -1: ### parse_queue finished
106110
tokenize_queue.put((-1, None, None))
107-
sys.stderr.write("tokenize theread %s finish\n" % (id))
111+
sys.stderr.write("Thread %s finish\n" % (id))
108112
break
109113
tokenize_instance = tokenize(instance)
110114
tokenize_queue.put((num_batch, tokenize_instance, pre_fix))

demo/semantic_role_labeling/data/get_data.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@
1414
# limitations under the License.
1515
set -e
1616
wget http://www.cs.upc.edu/~srlconll/conll05st-tests.tar.gz
17-
wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/verbDict.txt --no-check-certificate
18-
wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/targetDict.txt --no-check-certificate
19-
wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/wordDict.txt --no-check-certificate
20-
wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/emb --no-check-certificate
17+
wget http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/verbDict.txt
18+
wget http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/targetDict.txt
19+
wget http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/wordDict.txt
20+
wget http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/emb
2121
tar -xzvf conll05st-tests.tar.gz
2222
rm conll05st-tests.tar.gz
2323
cp ./conll05st-release/test.wsj/words/test.wsj.words.gz .

doc/demo/quick_start/index_en.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,12 +59,11 @@ To build your text classification system, your code will need to perform five st
5959
## Preprocess data into standardized format
6060
In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product.
6161

62-
`demo/quick_start` in the [source code](https://github.com/baidu/Paddle) provides scripts for downloading data and preprocessing data as shown below. The data process takes several minutes (about 3 minutes in our machine).
62+
`demo/quick_start` in the [source code](https://github.com/PaddlePaddle/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`).
6363

6464
```bash
6565
cd demo/quick_start
6666
./data/get_data.sh
67-
./preprocess.sh
6867
```
6968

7069
## Transfer Data to Model

doc_cn/build_and_install/install/docker_install.rst

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
安装PaddlePaddle的Docker镜像
22
============================
33

4-
PaddlePaddle提供了Docker的使用镜像。PaddlePaddle推荐使用Docker进行PaddlePaddle的部署和
5-
运行。Docker是一个基于容器的轻量级虚拟环境。具有和宿主机相近的运行效率,并提供
6-
了非常方便的二进制分发手段。
4+
PaddlePaddle项目提供官方 `Docker <https://www.docker.com/>`_ 镜像。Docker镜像是我们目前唯一官方支持的部署和运行方式。
75

86
下述内容将分为如下几个类别描述。
97

@@ -41,7 +39,7 @@ PaddlePaddle提供的Docker镜像版本
4139
* CPU WITHOUT AVX: CPU版本,不支持AVX指令集的CPU也可以运行
4240
* GPU WITHOUT AVX: GPU版本,不需要AVX指令集的CPU也可以运行。
4341

44-
用户可以选择对应版本的docker image。使用如下脚本可以确定本机的CPU知否支持 :code:`AVX` 指令集\:
42+
用户可以选择对应版本的docker image。使用如下脚本可以确定本机的CPU是否支持 :code:`AVX` 指令集\:
4543

4644
.. code-block:: bash
4745
@@ -67,7 +65,7 @@ mac osx或者是windows机器,请参考
6765

6866
.. code-block:: bash
6967
70-
$ docker run -it paddledev/paddlepaddle:cpu-latest
68+
$ docker run -it paddledev/paddle:cpu-latest
7169
7270
即可启动和进入PaddlePaddle的container。如果运行GPU版本的PaddlePaddle,则需要先将
7371
cuda相关的Driver和设备映射进container中,脚本类似于
@@ -76,7 +74,7 @@ cuda相关的Driver和设备映射进container中,脚本类似于
7674
7775
$ export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
7876
$ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
79-
$ docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddlepaddle:latest-gpu
77+
$ docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddle:gpu-latest
8078
8179
进入Docker container后,运行 :code:`paddle version` 即可打印出PaddlePaddle的版本和构建
8280
信息。安装完成的PaddlePaddle主体包括三个部分, :code:`paddle` 脚本, python的

doc_cn/demo/quick_start/index.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,11 @@
3232

3333
## 数据格式准备(Data Preparation)
3434
在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/)
35-
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/baidu/Paddle)`demo/quick_start`里提供了数据下载脚本
36-
和预处理脚本。
35+
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。
3736

3837
```bash
3938
cd demo/quick_start
4039
./data/get_data.sh
41-
./preprocess.sh
4240
```
4341

4442
## 数据向模型传送(Transfer Data to Model)
@@ -143,7 +141,7 @@ PyDataProvider2</a>。
143141

144142
我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置
145143
连接请参考<a href = "../../../doc/layer.html">Layer文档</a>。
146-
所有配置在[源码](https://github.com/baidu/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。
144+
所有配置在[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。
147145

148146
### 逻辑回归模型(Logistic Regression)
149147

0 commit comments

Comments
 (0)