263 Commits

Author SHA1 Message Date
Matt Wells
53ee1039b8 import final hung crawl fix into old gb.
xor the firstip into the doledb key this time.
seems to avoid all collisions now so we don't
overwrite nodes in the doledb tree.
2016-11-07 09:11:27 -08:00
Matt Wells
616bfeea86 show corrupt collection numbers for spiderdb corrupted recs. 2016-08-15 12:55:22 -07:00
Zak Betz
0ee1e6164c Add input validation to regexs before crawlbot collections are created.
Add ./gb egrep command to test regexes.
2016-05-11 14:03:45 -06:00
Matt
5072e851b7 fix misspelling 2016-03-28 17:26:40 -06:00
Matt Wells
1d2dfe1456 bring back max doc len parms.
index gbssIsContentTruncated field.
fix 30-day wait for >= 3 errors.
fix gbss formatting some more.
2016-02-08 14:10:04 -08:00
Matt Wells
becc244e12 add new link to page crawlbot to see spider attempt
gbss docs.
2015-12-15 16:22:53 -08:00
Matt Wells
ddd5cc711e show hosts that say a collection has urls ready to spider
on page crawlbot.
2015-09-09 16:56:01 -07:00
Matt Wells
cbf01ab77c add download new urls.csv link to crawlbot page 2015-08-29 14:28:01 -07:00
sam
f3d35b557f should solve defect 2015-07-13 18:08:25 -07:00
Matt Wells
eb10c62303 fix support for _html.json 2015-04-25 14:37:16 -07:00
Matt Wells
fc6d9631c5 fix for _html.json download 2015-04-24 18:02:46 -06:00
Matt Wells
a2feab9a4a tap in some fixes for running the newly updated smokes
for dealing with the new urls.csv format
2015-04-21 15:20:57 -07:00
Matt
11ea50935d use new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv
version 3
2015-04-15 17:48:55 -06:00
Matt
ef42a9cf28 new urls.csv polish. moved columns around. added
some new gbss fields, like spidered time.
2015-04-15 17:42:56 -06:00
Matt Wells
61af961dfd use m_sentToDiffbotThisTime in SpiderReply now too 2015-04-14 15:23:12 -07:00
Matt
90456222b6 now we add the spider status docs as json documents.
so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -06:00
mwells
dfc069aaa1 do away with filtered/banned spider priorities.
add checkbox to signify force deletes to remove urls from index
if in the index, or not allow them in.
2015-03-17 20:27:23 -06:00
Matt
dfd6d8b2cf fix critical spider bug that was deleting pages
because of bogus SpiderReply::m_langId values!
2015-03-05 08:49:39 -08:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
adcef39376 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Collectiondb.h
	Conf.cpp
	Conf.h
	Msg39.cpp
	PageEvents.cpp
	PageResults.cpp
	PageTurk.cpp
	Pages.cpp
	Parms.cpp
	Posdb.cpp
	Proxy.cpp
	Query.cpp
	Query.h
	RdbBase.cpp
	RdbMap.cpp
	Repair.cpp
	Repair.h
	SafeBuf.cpp
	Spider.cpp
	Tagdb.cpp
	TopTree.cpp
	XmlDoc.cpp
	main.cpp
2014-11-20 16:53:07 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
23c565afc8 fix a couple of cores.
reduce memory usage significantly by not
pre-allocating some per-collection hashtables.
2014-11-05 09:36:42 -08:00
emmanuelcharon
790c525820 rename diffbotHopcount to maxHops 2014-11-04 16:05:20 -08:00
emmanuelcharon
c29dedd714 added diffbotHopcount parameter for diffbot crawl and bulk jobs, also updated PageCrawlbot.cpp 2014-10-31 16:34:31 -07:00
Matt Wells
dbd3898cf0 fix a couple cores 2014-10-31 13:36:07 -07:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
0409571262 Merge branch 'diffbot-testing' into testing
Conflicts:
	Spider.cpp
2014-07-28 14:37:44 -07:00
Matt Wells
3a32301f99 minor ui update 2014-07-28 13:26:17 -07:00
Matt Wells
d2b1196a85 Merge branch 'diffbot-testing' into testing 2014-07-22 10:47:33 -07:00
Matt Wells
46ca3fceeb fix oops 2014-07-22 10:18:03 -07:00
Matt Wells
248b02ea9e fix another spiderdb corruption core 2014-07-22 06:34:34 -07:00
mwells
d5805733e5 more api updates 2014-07-13 09:35:44 -07:00
mwells
5f26918910 lots of bug fixes. more qa fixes. 2014-07-11 08:00:30 -07:00
Matt Wells
1361e5728c show actual diffbot error in urls.csv.
do not stop indexing page and harvesting links on diffbot error.
2014-07-02 11:53:24 -07:00
mwells
2b321efb1e debug log msgs 2014-07-01 16:57:25 -07:00
Matt Wells
e9ff8c48d8 try to remove the sluggishness from
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
Matt Wells
e36d9d1f3a turn off dup removal for all download queries now,
not just bulk jobs. it is confusing ppl too much
2014-06-17 18:50:42 -07:00
mwells
4e5cf747dc cygwin fixes 2014-06-07 16:29:39 -07:00
Matt Wells
fcc8bc85cc update bulk job restart 2014-06-04 09:36:26 -07:00
Matt Wells
b534ac5812 do not print completed time if spidering is going on 2014-06-03 20:30:10 -07:00
Daniel Steinberg
79b2d4859b printCrawlDetailsInJson signature without version 2014-05-28 10:41:32 -07:00
Daniel Steinberg
c06f9fde36 gigablast now has a notion of version based on the request 2014-05-27 20:11:12 -07:00
Matt Wells
b8886c399c show start/end job times on pagecrawlbot. 2014-05-21 13:55:01 -07:00
Daniel Steinberg
6afa3f2561 save spots to disk as space separated 2014-05-14 14:40:46 -07:00
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
mwells
ac5cf7971b more misc updates. 2014-04-05 18:09:04 -07:00
Daniel Steinberg
0efac8c156 Defect : seed URLs duplicated 2014-03-25 17:25:55 -07:00