Matt Wells
|
53ee1039b8
|
import final hung crawl fix into old gb.
xor the firstip into the doledb key this time.
seems to avoid all collisions now so we don't
overwrite nodes in the doledb tree.
|
2016-11-07 09:11:27 -08:00 |
|
Matt Wells
|
616bfeea86
|
show corrupt collection numbers for spiderdb corrupted recs.
|
2016-08-15 12:55:22 -07:00 |
|
Zak Betz
|
0ee1e6164c
|
Add input validation to regexs before crawlbot collections are created.
Add ./gb egrep command to test regexes.
|
2016-05-11 14:03:45 -06:00 |
|
Matt
|
5072e851b7
|
fix misspelling
|
2016-03-28 17:26:40 -06:00 |
|
Matt Wells
|
1d2dfe1456
|
bring back max doc len parms.
index gbssIsContentTruncated field.
fix 30-day wait for >= 3 errors.
fix gbss formatting some more.
|
2016-02-08 14:10:04 -08:00 |
|
Matt Wells
|
becc244e12
|
add new link to page crawlbot to see spider attempt
gbss docs.
|
2015-12-15 16:22:53 -08:00 |
|
Matt Wells
|
ddd5cc711e
|
show hosts that say a collection has urls ready to spider
on page crawlbot.
|
2015-09-09 16:56:01 -07:00 |
|
Matt Wells
|
cbf01ab77c
|
add download new urls.csv link to crawlbot page
|
2015-08-29 14:28:01 -07:00 |
|
sam
|
f3d35b557f
|
should solve defect #3002
|
2015-07-13 18:08:25 -07:00 |
|
Matt Wells
|
eb10c62303
|
fix support for _html.json
|
2015-04-25 14:37:16 -07:00 |
|
Matt Wells
|
fc6d9631c5
|
fix for _html.json download
|
2015-04-24 18:02:46 -06:00 |
|
Matt Wells
|
a2feab9a4a
|
tap in some fixes for running the newly updated smokes
for dealing with the new urls.csv format
|
2015-04-21 15:20:57 -07:00 |
|
Matt
|
11ea50935d
|
use new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv
version 3
|
2015-04-15 17:48:55 -06:00 |
|
Matt
|
ef42a9cf28
|
new urls.csv polish. moved columns around. added
some new gbss fields, like spidered time.
|
2015-04-15 17:42:56 -06:00 |
|
Matt Wells
|
61af961dfd
|
use m_sentToDiffbotThisTime in SpiderReply now too
|
2015-04-14 15:23:12 -07:00 |
|
Matt
|
90456222b6
|
now we add the spider status docs as json documents.
so you can facet/sortby the various fields, etc.
|
2015-03-19 16:17:36 -06:00 |
|
mwells
|
dfc069aaa1
|
do away with filtered/banned spider priorities.
add checkbox to signify force deletes to remove urls from index
if in the index, or not allow them in.
|
2015-03-17 20:27:23 -06:00 |
|
Matt
|
dfd6d8b2cf
|
fix critical spider bug that was deleting pages
because of bogus SpiderReply::m_langId values!
|
2015-03-05 08:49:39 -08:00 |
|
mwells
|
87285ba3cd
|
use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
|
2015-01-13 12:25:42 -07:00 |
|
Matt
|
adcef39376
|
Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
Collectiondb.cpp
Collectiondb.h
Conf.cpp
Conf.h
Msg39.cpp
PageEvents.cpp
PageResults.cpp
PageTurk.cpp
Pages.cpp
Parms.cpp
Posdb.cpp
Proxy.cpp
Query.cpp
Query.h
RdbBase.cpp
RdbMap.cpp
Repair.cpp
Repair.h
SafeBuf.cpp
Spider.cpp
Tagdb.cpp
TopTree.cpp
XmlDoc.cpp
main.cpp
|
2014-11-20 16:53:07 -08:00 |
|
Matt
|
4e8a42e024
|
text replacements for bad int32_t substitutions
|
2014-11-17 18:24:38 -08:00 |
|
Matt
|
931a1c4bc6
|
good checkpoint. quite a few fixes.
|
2014-11-17 18:13:36 -08:00 |
|
Matt
|
96b8197ad3
|
now it compiles with -m32
|
2014-11-10 14:45:11 -08:00 |
|
Matt Wells
|
23c565afc8
|
fix a couple of cores.
reduce memory usage significantly by not
pre-allocating some per-collection hashtables.
|
2014-11-05 09:36:42 -08:00 |
|
emmanuelcharon
|
790c525820
|
rename diffbotHopcount to maxHops
|
2014-11-04 16:05:20 -08:00 |
|
emmanuelcharon
|
c29dedd714
|
added diffbotHopcount parameter for diffbot crawl and bulk jobs, also updated PageCrawlbot.cpp
|
2014-10-31 16:34:31 -07:00 |
|
Matt Wells
|
dbd3898cf0
|
fix a couple cores
|
2014-10-31 13:36:07 -07:00 |
|
Matt Wells
|
e7dd8f7956
|
replace long long with int64_t
|
2014-10-30 13:36:39 -06:00 |
|
Matt Wells
|
b13f3d24d7
|
replaced unsigned long long with uint64_t
|
2014-10-30 13:30:39 -06:00 |
|
mwells
|
0409571262
|
Merge branch 'diffbot-testing' into testing
Conflicts:
Spider.cpp
|
2014-07-28 14:37:44 -07:00 |
|
Matt Wells
|
3a32301f99
|
minor ui update
|
2014-07-28 13:26:17 -07:00 |
|
Matt Wells
|
d2b1196a85
|
Merge branch 'diffbot-testing' into testing
|
2014-07-22 10:47:33 -07:00 |
|
Matt Wells
|
46ca3fceeb
|
fix oops
|
2014-07-22 10:18:03 -07:00 |
|
Matt Wells
|
248b02ea9e
|
fix another spiderdb corruption core
|
2014-07-22 06:34:34 -07:00 |
|
mwells
|
d5805733e5
|
more api updates
|
2014-07-13 09:35:44 -07:00 |
|
mwells
|
5f26918910
|
lots of bug fixes. more qa fixes.
|
2014-07-11 08:00:30 -07:00 |
|
Matt Wells
|
1361e5728c
|
show actual diffbot error in urls.csv.
do not stop indexing page and harvesting links on diffbot error.
|
2014-07-02 11:53:24 -07:00 |
|
mwells
|
2b321efb1e
|
debug log msgs
|
2014-07-01 16:57:25 -07:00 |
|
Matt Wells
|
e9ff8c48d8
|
try to remove the sluggishness from
all hosts... should really reduce load.
|
2014-06-25 17:46:28 -07:00 |
|
Matt Wells
|
e36d9d1f3a
|
turn off dup removal for all download queries now,
not just bulk jobs. it is confusing ppl too much
|
2014-06-17 18:50:42 -07:00 |
|
mwells
|
4e5cf747dc
|
cygwin fixes
|
2014-06-07 16:29:39 -07:00 |
|
Matt Wells
|
fcc8bc85cc
|
update bulk job restart
|
2014-06-04 09:36:26 -07:00 |
|
Matt Wells
|
b534ac5812
|
do not print completed time if spidering is going on
|
2014-06-03 20:30:10 -07:00 |
|
Daniel Steinberg
|
79b2d4859b
|
printCrawlDetailsInJson signature without version
|
2014-05-28 10:41:32 -07:00 |
|
Daniel Steinberg
|
c06f9fde36
|
gigablast now has a notion of version based on the request
|
2014-05-27 20:11:12 -07:00 |
|
Matt Wells
|
b8886c399c
|
show start/end job times on pagecrawlbot.
|
2014-05-21 13:55:01 -07:00 |
|
Daniel Steinberg
|
6afa3f2561
|
save spots to disk as space separated
|
2014-05-14 14:40:46 -07:00 |
|
Matt Wells
|
20a2729827
|
added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
|
2014-04-25 14:12:18 -07:00 |
|
mwells
|
ac5cf7971b
|
more misc updates.
|
2014-04-05 18:09:04 -07:00 |
|
Daniel Steinberg
|
0efac8c156
|
Defect #2080: seed URLs duplicated
|
2014-03-25 17:25:55 -07:00 |
|