3191980f49the new urls.csv format is ready. added url discovered time to gbssdocs so we know when we first found a url. also added to new urls.csv. fixed spiderdb list deduping so as not to discard the oldest spider request any more so we keep our discovered time in tact.
Matt
2015-04-15 12:13:27 -06:00
f0f8f0a967Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-14 16:27:28 -06:00
0c88ebba9bremoved buggy close least used linked list logic. was causing data corruption in reads and writes. go to urgent shutdown mode if on 10th try so gb will actually exit. do not startup if there is critical data corruption.
Matt Wells
2015-04-14 15:26:46 -07:00
61af961dfduse m_sentToDiffbotThisTime in SpiderReply now too
Matt Wells
2015-04-14 15:23:12 -07:00
99454bc8caadded gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round.
Matt Wells
2015-04-14 14:50:39 -07:00
a92b158bc7gbss doc oopsy fixes
Matt
2015-04-13 16:47:41 -06:00
040a604ec6xmldoc back to o2
Matt
2015-04-13 14:49:44 -06:00
497131d359fix gbssdocid bug better
Matt
2015-04-13 14:33:57 -06:00
3e5218c54cfix gbssDocId:123456789, et al, query. will only work for docs indexed after applying this fix.
Matt
2015-04-13 14:13:16 -06:00
31ac1fa2b0quick fix for spider status
Matt Wells
2015-04-13 12:16:55 -07:00
8d1e67be0athe diffbot objects we index as their own separate doc should inherit the hopcount from the parent html doc.
Matt Wells
2015-04-13 12:12:43 -07:00
4a32c8308eMerge branch 'diffbot-testing' into diffbot
Matt Wells
2015-04-13 12:08:06 -07:00
614e9215cdupdate getSpiderStatusMsg() to always set *status. always show diffbotreply when doing crawlbottesting
Matt Wells
2015-04-13 12:06:22 -07:00
9f836dbf75fix corruption of s_vbuf (gb version) in the hosts table.
Matt
2015-04-13 11:13:44 -06:00
4a43e1387ebetter fixes for core from sig alarms
Matt
2015-04-13 10:28:43 -06:00
48ac8bf80ffix udp linked list thing again
Matt
2015-04-13 10:13:59 -06:00
e6696a6937fix some more
Matt
2015-04-13 10:08:01 -06:00
9feb070fe9fix issue of not being able to exit gb when a disk read retry is taking forever.
Matt
2015-04-13 10:06:08 -06:00
f5a7423336fix bug of never calling callback
Matt
2015-04-13 09:56:21 -06:00
47ed2a57eeupdate log msg
Matt Wells
2015-04-13 07:49:57 -07:00
43ced700d0calls NEWS BLOG
Matt
2015-04-12 12:33:09 -06:00
2814e3db37show screenshots
Matt
2015-04-12 11:52:22 -06:00
994ba73007log more when doing crawlbottesting-* tests
Matt Wells
2015-04-12 06:46:33 -07:00
56a46fd294fix printing of facets when &header=0 so diffbot json output is still simple and correct.
mwells
2015-04-10 16:25:38 -06:00
a891fb7bdcturn on indexing spider status docs for all diffbot CRAWLS on startup, whether it was off or on before.
Matt Wells
2015-04-10 14:35:41 -07:00
4dce44c976fix log msg
Matt Wells
2015-04-10 13:31:46 -07:00
02aa138fbbdebug helper msg
Matt Wells
2015-04-10 13:27:06 -07:00
13d0361756try to speed up host #4 on seraph
Matt
2015-04-10 09:20:18 -06:00
6b139e9eeeclarify jam ups
Matt Wells
2015-04-08 18:30:27 -07:00
e38cb8c080Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-08 18:08:10 -07:00
a7e48515fafix core when adding gbss for a force deleted doc.
Matt Wells
2015-04-08 18:07:32 -07:00
64bae224e0fix core on the GI
Matt Wells
2015-04-08 16:05:32 -06:00
97d3b185c1just use INCOMING udp slots/sockets for jam detection. this will highlight the slow nodes better.
Matt Wells
2015-04-08 15:52:43 -06:00
7fd4310106don't include gbss headers if not gbss documents
Matt Wells
2015-04-07 22:05:11 -07:00
2997b3bb28fix for skipping dead shards on tag re clookup
mwells
2015-04-07 14:36:47 -06:00
53a2d39afdfix for calling callback of timedout udp slots
mwells
2015-04-07 14:18:42 -06:00
2114c40cdafix not calling callback when udp reply times out. like for msg39 replies we need to timeout quickly.
Matt Wells
2015-04-07 12:38:35 -07:00
05a66cc367fix bug of not able to get ip address because peeksize is too big.
Matt Wells
2015-04-07 12:29:19 -07:00
b08d12a11efix cores associated with new spider status docs.
Matt Wells
2015-04-07 10:33:54 -07:00
4ed8231222upp max vfds again
Matt Wells
2015-04-07 09:56:04 -06:00
036fb4e0dccrap, another oopsy fix
Matt
2015-04-06 14:43:58 -07:00
bc3335c434more facet counting fixes
Matt
2015-04-06 14:38:38 -07:00
74fe3c5866more facet counting fixes
Matt
2015-04-06 14:02:59 -07:00
1a262c8254fixed oopsy
Matt
2015-04-06 13:51:19 -07:00
8326460e8ffix counting of # docs that have facet field.
Matt Wells
2015-04-06 14:41:44 -06:00
330d9a9dbfreport max rounds reached, not max to process or crawl reached.
Matt Wells
2015-04-06 10:24:33 -06:00
bffaa09599fix for the GI
Matt
2015-04-06 08:24:00 -06:00
de187dbb2bdocumentation fix
Matt
2015-04-03 16:00:04 -06:00
8433c49aa9make sure we index a spider status doc for each diffbot object. that way we can tell if diffbot objects are deduping, how they are changing over time, etc.
Matt
2015-04-03 14:59:09 -06:00
dad1cb15f4fix excessive looping when calling makeCallbacks() on niceness 1 or above when none are available.
Matt
2015-04-03 12:12:58 -06:00
c991a2dcddtry to ameliorate the udp slot jamming issue.
Matt
2015-04-03 10:43:11 -06:00
c2567ad244a hopeful fix for host #0 always crashing from streaming socket timeouts.
Matt
2015-04-02 15:17:49 -06:00
2ce107e4bekeep track of how many times the host exited/cored as an exponent to the 'x' in the hosts table. this way we can detect hosts that have restarted many times and fix them.
Matt
2015-04-01 16:28:58 -06:00
e583850e40fix core when searching bogus collection.
Matt
2015-04-01 15:30:59 -06:00
94a8210586added CSV to output dropdown. show all json fields for spider status doc csv files. support spider status docs in csv output.
Matt
2015-04-01 13:53:03 -06:00
f26c9d609bone more qa test fix for spider status docs
Matt
2015-04-01 12:47:32 -06:00
5e46262cb2more fixes for qa'ing of new spider status docs
Matt
2015-04-01 12:03:17 -06:00
10a31783bbfixes to pass internal qa tests in light of gbss (spider status doc) changes and other things. had to make xmldoc.o -O2 instead of -O3 to fix strange bug.
Matt
2015-04-01 11:20:36 -06:00
6b293f17e6now show "totalDocsWithField" for each facet, so we know how many docs had that field, with any particular value, so we can do tf/idf type things.
Matt
2015-04-01 09:16:42 -06:00
47f6d9f414clean out rebuild trees/buckets too
mwells
2015-03-21 22:42:49 -06:00
e99b2f0a65added RdbBuckets::cleanBuckets() corresponding to RdbTree::cleanTree() to remove keys from deleted collections at startup.
Matt
2015-03-21 22:28:34 -06:00
000c5d67e9do not index xml docs' body for custom crawls or when indexbody is turned off.
Matt
2015-03-21 09:21:44 -06:00
9f42a6d5fffix indexing of spider status docs
Matt Wells
2015-03-20 18:08:39 -07:00
7d82a5ca69try to get diffbot reply info first before making spider status doc
Matt Wells
2015-03-20 17:44:21 -07:00
07d13541edemergency fixes for corrupt tagdb tag id
Matt Wells
2015-03-20 17:21:52 -07:00
62bbede498try to fix strange unknown tagid core
Matt Wells
2015-03-20 14:20:40 -07:00
7343c40e98fix a couple status doc fields.
Matt Wells
2015-03-20 12:51:36 -06:00
1a80cb1b5dMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
mwells
2015-03-20 12:45:24 -06:00
7de9f6940bdocumentation for new gbss spider status doc fields.
mwells
2015-03-20 12:43:21 -06:00
70bdc97bfdupdate README.md
Matt
2015-03-19 23:31:09 -06:00
afa14e84a1fixes for generating spider status docs. and displaying them.
mwells
2015-03-19 23:24:36 -06:00
fc0b3e7743url filters fixes.
Matt
2015-03-19 18:17:26 -06:00
ed5fe6d284more bug fixes from 'delete' column addition to url filters.
Matt
2015-03-19 18:05:36 -06:00
4b1dcfa068fix isrssext bug
Matt
2015-03-19 17:28:19 -06:00
90456222b6now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc.
Matt
2015-03-19 16:17:36 -06:00
d5560c3e77final fix for new delete column
mwells
2015-03-17 21:47:31 -06:00
c9b14b1b89fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags.
Matt
2015-03-17 21:20:27 -06:00
dfc069aaa1do away with filtered/banned spider priorities. add checkbox to signify force deletes to remove urls from index if in the index, or not allow them in.
mwells
2015-03-17 20:27:23 -06:00
dea534827elangidbits init bug leftover from searchinput reset memset fix i think.
Matt
2015-03-17 15:04:31 -06:00
f830eb43f7Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-17 14:31:33 -06:00
a54471849bsitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted.
Matt
2015-03-17 14:26:16 -06:00
29b2707ad7fix searchinput::clear() bug. final fix for fhtqt memleak bug.
Matt Wells
2015-03-15 07:48:29 -07:00
3b39b1d37afix facet mem leak from QueryTerm::m_facetHashTable and safebuf when doing federated queries over a token.
Matt Wells
2015-03-15 07:18:32 -07:00
427fae7135fix log spam
Matt Wells
2015-03-12 22:31:40 -07:00
e71af6d26crecompute active list every 3 secs. otherwise it seems buggy and drops collections it shouldn't.
Matt Wells
2015-03-12 22:22:19 -07:00
83be5d7d46fix links parser so it harvests outlinks from rss feeds' <link> tags. it was doing this before, now it is doing it again.
Matt
2015-03-12 17:35:47 -07:00
7879537ab6fix json search results formatting.
Matt Wells
2015-03-12 14:25:37 -07:00
5d2a9d6d8cMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-12 14:00:03 -07:00
89b61d95a8fix fix
Matt Wells
2015-03-12 13:59:42 -07:00
3c2b082540gbfacetstr: is case-sensitive.
Matt Wells
2015-03-12 13:54:11 -07:00
a8dfa56098Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-12 13:19:16 -07:00
485c600c7cfix for changing maxtocrawl/process/rounds. fix for thinking a crawl is done when it is just taking a while to populate doledb from the waiting tree for that SpiderColl. we just call populateDoledbFromWaiting in doneSleepingWrapper avery 50ms. it loops over every coll so it could be more efficient.
Matt Wells
2015-03-12 13:15:52 -07:00
a4e95899bdmakefile updates for building pkgs
Matt
2015-03-10 21:43:37 -07:00