MongoDB pipeline for Scrapy. This module supports both MongoDB in standalone setups and replica sets. This module will insert the items to MongoDB as soon as your spider finds data to extract.
MongoDB pipeline for Scrapy. This module supports both MongoDB in standalone setups and replica sets. This module will insert the items to MongoDB as soon as your spider finds data to extract.
scrapy-mongodb
can also buffer objects if you prefer to write chunks of data to MongoDB rather than one write per document. See the MONGODB_BUFFER_DATA
option for details.
Install via pip
:
pip install scrapy-mongodb
Add scrapy-mongodb
to your projects settings.py
file.
ITEM_PIPELINES = [
'scrapy_mongodb.MongoDBPipeline',
]
MONGODB_URI = 'mongodb://localhost:27017'
MONGODB_DATABASE = 'scrapy'
MONGODB_COLLECTION = 'my_items'
If you want a unique key in your database, add the key to the configuration like this:
MONGODB_UNIQUE_KEY = 'url'
You can configure scrapy-mongodb
to support MongoDB replica sets simply by adding the MONGODB_REPLICA_SET
and MONGODB_REPLICA_SET_HOSTS
config option:
MONGODB_REPLICA_SET = 'myReplicaSetName'
MONGODB_URI = 'mongodb://host1.example.com:27017,host2.example.com:27017,host3.example.com:27017'
If you need to ensure that your data has been replicated, use the MONGODB_REPLICA_SET_W
option. It is an implementation of the w
parameter in pymongo
. Details from the pymongo
documentation:
Write operations will block until they have been replicated to the specified number or tagged set of servers. w=<int> always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options.
To ease the load on MongoDB scrapy-mongodb
has a buffering feature. You can enable it by simply setting the MONGODB_BUFFER_DATA
to the buffer size you want. If you set it to 10
scrapy-mongodb
will write 10 documents at a time to MongoDB.
MONGODB_BUFFER_DATA = 10
It is not possible to combine this feature with MONGODB_UNIQUE_KEY
. Technically due to that the update
method in pymongo
doesn't support multi doc updates.
scrapy-mongodb
can append a timestamp to your item when inserting it to the database. Enable this feature by like this:
MONGODB_ADD_TIMESTAMP = True
This will modify the document to look something like this:
{
...
'scrapy-mongodb': {
'ts': ISODate("2013-01-10T07:43:56.797Z")
}
...
}
The timestamp is in UTC.
Configuration options available. Put these in your settings.py
file.
Parameter | Default | Required? | Description |
---|---|---|---|
MONGODB_DATABASE | scrapy-mongodb | No | Database name to use. Does not need to exist. |
MONGODB_COLLECTION | items | No | Collection within the database to use. Does not need to exist. |
MONGODB_URI | mongodb://localhost:27017 | No |
Add the URI to the MongoDB instance or replica set you want to connect to. It must start with mongodb://. See more in the MongoDB docs 1). Some example strings: mongodb://user:pass@host:port mongodb://user:pass@host:port,host2:port2, |
MONGODB_UNIQUE_KEY | None | No | If you want to have a unique key in the database, enter the key name here. scrapy-mongodb will ensure the key is properly indexed. |
MONGODB_BUFFER_DATA | None | No | To ease the load on MongoDB you might want to buffer data in the client before sending it to MongoDB. Set this option to the number of items you want to buffer in the client before sending them to MongoDB. Setting a MONGODB_UNIQUE_KEY together with MONGODB_BUFFER_DATA is not supported. |
MONGODB_ADD_TIMESTAMP | False | No |
If this is set to True, scrapy-mongodb will add a timestamp key to the documents. The document will look like this: { scrapy_mongo: { ts: ISODate("2013-01-10T07:43:56.797Z") } } |
MONGODB_FSYNC | False | No | If this is set to True it forces MongoDB to wait for all files to be synced before returning. |
MONGODB_REPLICA_SET | None | Yes, for replica sets | Set this if you want to enable replica set support. The option should be given the name of the replica set you want to connect to. MONGODB_HOST and MONGODB_PORT should point at your config server. |
MONGODB_REPLICA_SET_W | 0 | No |
Best described in the pymongo documentation 2): Write operations will block until they have been replicated to the specified number or tagged set of servers. w= always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options. |
MONGODB_HOST | localhost | No |
DEPRECATED since scrapy-mongodb 0.5.0, use MONGODB_URI instead. MongoDB host name to connect to. |
MONGODB_PORT | 27017 | No |
DEPRECATED since scrapy-mongodb 0.5.0, use MONGODB_URI instead. MongoDB port number to connect to. |
MONGODB_REPLICA_SET_HOSTS | None | No |
DEPRECATED since scrapy-mongodb 0.5.0, use MONGODB_URI instead. Host string to use to connect to the replica set. See the hosts_or_uri option in the pymongo documentation. |
0.6.2 (2013-08-23)
0.6.1 (2013-07-14)
0.6.0 (2013-06-04)
0.5.1 (2013-06-03)
0.5.0 (2013-01-10)
0.4.0 (2013-01-07)
0.3.0 (2013-01-06)
0.2.0 (2013-01-06)
0.1.0 (2013-01-06)
scrapy-mongodb
pipeline modulemake release
This project is maintained by Sebastian Dahlgren (GitHub | Twitter | LinkedIn)
APACHE LICENSE 2.0 Copyright 2013 Sebastian Dahlgren
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.